Sequence-based prediction of physicochemical interactions at protein functional sites using a function-and-interaction-annotated domain profile database

Sequence-based prediction of physicochemical interactions at protein functional sites using a... Background: Identifying protein functional sites (PFSs) and, particularly, the physicochemical interactions at these sites is critical to understanding protein functions and the biochemical reactions involved. Several knowledge-based methods have been developed for the prediction of PFSs; however, accurate methods for predicting the physicochemical interactions associated with PFSs are still lacking. Results: In this paper, we present a sequence-based method for the prediction of physicochemical interactions at PFSs. The method is based on a functional site and physicochemical interaction-annotated domain profile database, called fiDPD, which was built using protein domains found in the Protein Data Bank. This method was applied to 13 target proteins from the very recent Critical Assessment of Structure Prediction (CASP10/11), and our calculations gave a Matthews correlation coefficient (MCC) value of 0.66 for PFS prediction and an 80% recall in the prediction of the associated physicochemical interactions. Conclusions: Our results show that, in addition to the PFSs, the physical interactions at these sites are also conserved in the evolution of proteins. This work provides a valuable sequence-based tool for rational drug design and side-effect assessment. The methodisfreelyavailable andcan be accessedat http://202.119.249.49. Keywords: Physicochemical interaction prediction, Protein functional site prediction, fiDPD, Hidden Markov model, Domain profile module Background physicochemical interactions, which is indispensable in- Most proteins perform biological functions via interactions formation for understanding protein biochemical reactions. with their partners, such as small molecules or ligands, Together with PFS prediction, accurate protein-ligand DNA/RNA, and other proteins, forming instantaneous or interaction (PLI) prediction opens up a new dimension in permanent complex structures. Of particular importance is correctly annotating protein function and thus provides that only a few pivotal amino acids on a protein’s surface, valuable information for rational drug design and drug usually called protein functional sites (PFSs), play key roles side-effect assessment [1–3]. To date, 3D protein-partner in determining these interactions. Thus, understanding complex structures have been the main source of know- protein functions depends upon accurate predictions of ledge about PFSs and PLIs. In recent years, in silico PFSs. However, PFSs alone do not reveal the details of their methods have received increasing attention as an alterna- tive strategy for protein function annotation, especially in * Correspondence: dming@njtech.edu.cn predicting PFSs. The advantage of these methods stems College of Biotechnology and Pharmaceutical Engineering, Nanjing Tech from two factors: the rapid accumulation of a large number University, Biotech Building Room B1-404, 30 South Puzhu Road, Jiangsu 211816 Nanjing, People’s Republic of China of complex 3D structures in publicly accessible databases Full list of author information is available at the end of the article © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Han et al. BMC Bioinformatics (2018) 19:204 Page 2 of 12 such as the Protein Data Bank (PDB) [4]and therapid the prediction. A profile hidden Markov model of the development of computer technology and computation HMMER program was used in the prediction to search a algorithms. module member of the database for a given protein. We In the last few decades, many computational methods applied the fiDPD method to 10 target proteins of CASP10 have emerged to identify PFSs from protein structures and [48] and CASP11 [49] and found that the method has a sequences [5]. Most sequence-based methods assume that Matthews correlation coefficient (MCC) value of 0.66 for functionally important residues are conserved through PFS prediction. Additionally, the model provided a cor- evolution and can be identified as conserved sites based rect physicochemical interaction prediction for 80% of on multiple sequence alignment (MSA) within homolo- theexaminedsites.Weexpect the present method to gous protein families [6–8]. Sequence-based information be a valuable auxiliary tool for conventional bioinformatic such as secondary structure propensity and the likely and protein function annotations. solvent accessible surface area (SASA) have also been used to improve the prediction [9–12]. In addition, Methods structure-based methods that essentially determine local Figure 1 shows the flow chart used to build fiDPD. We or overall structural similarity have been developed for first introduced the fDPD as a list of representative profile PFS prediction [13–16]. Typical local structural features modules built by sorting out structure-and-sequence include large clefts on protein surfaces [17, 18], special similar protein domains in the SCOP databases [50]. spatial arrangements of catalytic residues [19–21], and Next, PFSs and atomic patterns of PLIs were derived particular patterns between surface residues [22, 23]. from known protein-ligand-complex structures in the Other prediction methods have used both structural and PDB; then, after a series of site-to-site mappings, these sequence information [24, 25] and might, when combined structures were used to annotate fDPD profile modules with artificial intelligence techniques, provide encouraging and thus to build the fiDPD. results [26–28]. Other methods based on protein dynam- ics [29–34], conventional molecular dynamics and dock- fDPD was prepared based on the subgroup classification ing simulations [35–37]havealsobeen successfulinPSF of domain entries of the SCOP database prediction. To elucidate the physicochemical interactions We started with a modified classification of protein do- between proteins and their partners, particularly those be- main structures collected in the SCOP database [50, 51]. tween protein and ligands, researchers have attempted to In SCOP, a large protein structure is often manually di- characterize these interactions as early as the emergence vided into a few smaller parts or domains according to of the first protein-ligand complex structure. However, their spatial arrangement within the protein. A recent only very recently have structural bioinformatic tools version of SCOPe 2.05 was downloaded from http:// emerged with which to systematically characterize pro- scop.berkeley.edu/references/ver=2.05, which includes tein-ligand interactions (PLIs) [38–43] due to the rapid ac- 214,547 domain entries extracted from 75,226 protein cumulation of protein complex structures. Additionally, a structures in the PDB. In SCOP, these domain structures few databases record detailed atomic interactions be- are arranged in a hierarchical 7-level system—Class (cl), tween proteins and ligands, facilitating PLI studies Fold (cf), Superfamily (sf), Family (fa), Protein Domain [44–46]. These data provide new resources for the (dm), Species (sp), and PDB code identity (px)—according large-scale characterization of physicochemical inter- to their sequence, function and structure similarity. Spe- actions between proteins and their partners and have cifically, those domains listed in a given domain entry helped improve conventional docking simulation and (dm) presumably share the same class, fold, superfamily pharmacology research. Several knowledge-based or and protein family but might differ in species and PDB ab initio methods have been developed for the prediction code entry. Theoretically, PFSs are more likely to be of PFSs; however, an accurate method for predicting conserved when they share both higher structural and the physicochemical interactions associated with PFSs sequential similarity, and this assumption forms the basis is still lacking [47]. for our algorithm of fiDPD in the prediction of PFSs In this paper, we develop a new method for predicting and PLIs. Using a profile hidden Markov model of the physical interactions occurring on functional sites based HMMER program, the MSA of all the domains within on the amino acid sequences of given proteins. This the same dm entry gives a single representative profile sequence-based method first predicts PFSs from a func- module. In this way, 12,527 representative profile mod- tional site-annotated domain profile database, or fDPD, ules were created for all the dm entries, forming the and then assigns the types of interactions most likely to basis of fDPD and fiDPD. appear at the predicted sites. In this study, we derived a In building fDPD, it is important for protein domains functional site- and interaction-annotated domain profile within the same dm entry to be structurally and sequen- database, called fiDPD, which plays the primary role in tially close to one another. However, a quick calculation Han et al. BMC Bioinformatics (2018) 19:204 Page 3 of 12 Fig. 1 Flow-chart for building the function-site- and interaction-annotated domain profile database (fiDPD) and for predicting protein function- sites and PLIs using fiDSPD reveals that the C root-mean-square-distance (RMSD) MUSCLE [53], from which a profile module was then can be as large as 12 Å for many domain structures built using the hmmbuild module of the HMMER pro- listed in the same dm entry. This result indicates that gram (http://hmmer.org/ [54]). A profile module is a se- there are many domains listed in the same dm entry of quence of hypothetical amino acids, which is, instead of SCOPe 2.05 that have quite different structures, which conventional amino acids, probably a mixture of certain makes the profile modules of fDPD less representative of amino acids according to the MSA of the subgroup. For member proteins within the dm entry. To reduce the each individual position in a profile module, we defined difference, we divided the domains within a dm entry a conservation value C according to the MSA. We into a few smaller groups or subgroups so that selected assigned the C value as 0, 1, 3, or 4 for a position being domains within the same subgroup would have mutual nonconservative, minimally conservative, conservative C -RMSD < 7 Å and a mutual sequence similarity > 10 and highly conservative, as indicated respectively by a (a score calculated by the MSA program CLUSTALW gap, “+” symbol, a lowercase letter or a capital letter in [52]). Thus, derived subgroups then replace the dm the MUSCLE alignment. We also defined an overall vol- entry as the basic unit of fDPD. fDPD contains 16,559 ume value N for a profile module as the number of pro- subgroups, which is 32% more than the original SCOP tein domains listed in the subgroup: a larger N value dm entries, with approximately 12 member structures in usually indicates that more information is available for each subgroup, on average. that subgroup and thus a greater confidence on the annotation. fDPD is composed of functional site annotated protein A scoring function S was assigned to each position in profile modules based on multiple subgroup-protein an fDPD profile module to mark its propensity of being sequence alignment a functional site. To this end, we first mapped known In fDPD, sequences of protein domains in a subgroup functional sites of member proteins within the same were extracted and aligned using the MSA program subgroup to the profile module according to the MSA Han et al. BMC Bioinformatics (2018) 19:204 Page 4 of 12 To annotate the profile modules with PLIs, atomic interaction patterns between the protein and ligand were initially determined based on their 3D protein-ligand complex structures. Specifically, the atomic 3D coordi- nates of amino acids listed in PDB SITE sections and those of ligand molecules were filtered out from the PDB files; then, a series of atomic distances (d)werecalculated between PFSs (A ) and ligands (A ). Finally, a few Site Ligand types of bonding and nonbonding interactions for each A were determined based on the pairwise distances and Site Fig. 2 Mapping known protein function sites and interactions to a the biochemical properties of involved amino acids. domain-profile module, ⊗: known PFSs of domain structures, ⊙: pivotal PFSs in a profile module with the number indicating a H-bond weight factor, *: PFSs mapped into the query protein sequence from Almost all PLIs occur in aqueous environments, where profile module pivotal sites, which, after a filtering, is reduced to two water molecules play a critical role. As a result, hydrogen points (A and B) as a final prediction output, Δ: non-conservative pivotal sites mapped into the query protein, which will be ignored bonds might be consistently established and destroyed due to the low conservation value until a certain stable protein-ligand configuration is achieved. Here, we have calculated hydrogen bonds within (see Fig. 2). Functional sites of member proteins were the protein-ligand complex using the program HBPLUS collected from the SITE sections of the corresponding [58]. The program determines H-bond donor (D) and ac- PDB file. Of the 202,705 protein domains listed in ceptor (A) atom pairs based on a nonhydrogen atom con- SCOPe, 132,725 domain structures have a total of figuration using a maximum H–A distance of 2.5 Å, a 1,878,004 functional sites annotated in PDB SITE re- maximum D–A distance of 3.9 Å, a minimum D–H–A cords. Then, for simplicity, we assigned S as the total hit angle of 90° and a minimum H–A–AA angle of 90°, where number that a profile module position received based on H is the theoretical hydrogen atom and AA is the atom of the MSA. Thus, the larger a position’s S-value, the more functional sites in the H-bond acceptor. In this way, we likely it is to be a hypothetical functional site for the defined NHBA and NHBD as the total number of H-bond profile module. In this way, the profile modules were an- acceptors and H-bond donors, respectively, associated notated with known PFSs, and we called the database with atoms in a given functional site. composed of these profile modules the function-site- annotated domain profile database, or fDPD. Previously, Electrostatic interactions alternative functional site annotations for profile modules Electrostatic force plays important roles in many PLIs were also built by using different “known” PFSs derived and might be the main driving force to initiate catalytic from FDPA calculations instead of those recorded active reactions, to guide the recognition between protein and sites in the PDB database [55]. Compared with the dm en- ligand, and so on [59–61]. However, accurately deter- tries in the original SCOP, in fDPD, PFSs should be more mining atomic charges in bio-structure is a very challen- likely to be conserved since they share both higher struc- ging task since it is highly sensitive to the surrounding tural and higher sequential similarity. environment. Here, for simplicity, we identified electro- static interactions simply by examining the charging status of contact atoms in PLIs. Specifically, we first se- fiDPD was built by attaching physicochemical interaction lected positively charged nitrogen (N) atoms of func- annotations to functional sites in fDPD profile modules tionalsitesof Arg,His,andLysand then determined Obviously, the abovementioned S-value is heavily an electrostatic interaction if there a neighboring (< 4.5 Å) dependent on the means by which the “known” PFSs oxygen atom was present in the ligand, which is not part were determined. In this work, S-values are determined by of a cyclized structure. An electrostatic interaction was using only PDB SITE information, which, in most cases, also built when a negatively charged oxygen (O) atom is composed of manually prepared ligand-binding sites. from Asp and Glu residues was found near a ligand Other types of biologically relevant functional site data, nitrogen atom. We used NELE as the total number of such as enzyme active sites [56] and phosphorylation sites electrostatic interactions involving atoms in a given func- [57], might also be used in the annotation. Here, consider- tional site. ing the importance of PLIs in determining protein func- tion, we added PLI annotations to the profile modules of π-stacking interactions fDPD to build the function-site and interaction-annotated π-Stacking interactions play a critical role in orientating domain profile database, or fiDPD. ligands inside binding pockets. We first identified the Han et al. BMC Bioinformatics (2018) 19:204 Page 5 of 12 aromatic side chains of Trp, Phe, Tyr and His of PFSs different pivotal sites of the profile module according to and carbon-dominant cyclized structures of ligands. Usu- the MSA of the studied subgroup. As a result, each ally, aromatic rings form an effective π-stacking inter- fDPD profile module was annotated with interaction action when they get close enough (4.5–7 Å) and have vectors V on hypothetical functional sites, thus forming either a parallel or perpendicular orientation [62, 63]. the fiDPD. Here, for simplicity, we defined a π-stacking interaction if we could find three or more distinct heavy-atom pairs be- fiDPD predicts both functional sites and PLIs using a tween atoms from the aromatic ring of a given functional hidden Markov model site and those from ligand carbon-ring structures. We de- fiDPD is essentially a list of profile module entries anno- fined the total number of π-stacking interactions involving tated with domain functional sites and PLIs. In fiDPD, a given functional site as NPI. two steps are required to predict the hypothetical func- tional sites and involved PLIs for a given inquiry protein: Van der Waals interaction 1) identifying profile modules in fiDPD that match the A Van der Waals interaction is formed when the distance query sequence best and 2) interpreting pivotal func- d between a nonhydrogen atom of protein functional site tional sites and associated PLIs of the matched profile and a nonhydrogen atom of ligands satisfies the following modules as a prediction of PFSs and PLIs for the query inequality: protein based on certain statistical evaluations. In the first step, fiDPD scans the query sequence d < vdWðÞ A þ vdW A þ 0:5Å; Site Ligand against all its module entries using the SCAN module of the HMMER program [67]. The scan usually gives a where vdW(A) is the Van der Walls radius of atom A couple of profile modules within an alignment E-value − 5 and no covalent bond, coordination bond, hydrogen cutoff no greater than 1 × 10 . Each alignment (indexed bond, electrostatic force or π-stacking interaction is by superscript j in Eq. (1)) is assigned a scoring function found between them. A similar definition of the Van der E as the negative logarithm of the E-value score. Due to Waals interaction was also used by Kurgan and colleagues the limited volume of known protein sequences contained in their study of protein-small ligand interaction patterns in fiDPD, there are cases in which HMMER SCAN cannot [38] and by Ma and colleagues in their study of protein- find any match for the query protein, and for these cases, protein interactions [64]. The atomic Van der Waals radii fiDPD simply gives a notice of “no-hit.” In step 2), we de- were taken from the CHARMM22 force field [65]. Each fined a scoring function F for the ith residue of the query functional site was assigned an NVDW value as the total protein as its propensity to be a functional site: number of Van der Waals interactions involving atoms of j j j j this site. F ¼ S C N E ð1Þ 0 0 i i Covalent bond and coordinate bond where the summation runs over all the alignments j and i Usually, nonbonded forces dominate interactions between stands for the position of the profile module that matches a ligand and its target protein; however, irreversible cova- the ith residue of the query protein. Residues with a high- lent bonds are also found in PLIs when a tight and steady valued F-scoring function will be predicted as hypothetical connection between the ligand and receptor is essential to functional sites. the biological function, such as in the rhodopsin system One way to determine high-F-valued sites for a query [66]. A covalent bond is formed if the distance between a protein is to simply choose a certain number (n)of nonhydrogen atom from a functional site and a nonhy- top-valued residues, called n-top selection. This method drogen atom from ligand satisfies d < RðA Þþ RðA Þ Site Ligand has been used for enzyme catalytic site prediction [55] þ0:5 Å,where R(A) is the radius of atom A. For metal-ion since experimentally determined enzyme active sites ligands, this condition also defines coordinate bonds be- have a relatively fixed number as revealed by the Catalytic tween metal ions and PFSs. Usually, in coordinate bonds, Site Atlas (CSA) dataset [56]. Another method to select the shared electrons are present in atoms with higher elec- top-valued residues uses a cutoff percentage that was tronegativity in a functional site. We denoted NCOV as the proved to be efficient in a previous ligand-binding site total number of covalent bonds involving atoms in the func- prediction study [32, 34]. In this method, we first filtered tional site and NCOO as the total number of coordinate out those low-valued noise-like residues whose F-scores bonds involving atoms in that site. were smaller than a cutoff percentage M%ofthe max- We characterized a PLI between a PFS and the ligands imum F-value F ; then, for the remaining residues, max with a 7-dimensional interaction vector V = (NCOV, the top T% were predicted as hypothetical functional NCOO, NHBA, NHBD, NPI, NELE, NVDW). The inter- sites of the query protein. Usually, this selection strategy action vectors of all member proteins were summed in tends to give a greater prediction function for larger Han et al. BMC Bioinformatics (2018) 19:204 Page 6 of 12 proteins. We used this selection strategy to predict PFSs TP  TN−FP  FN MCC ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : in the remainder of this paper. The server is freely avail- ðÞ TP þ FP ∙ðÞ TP þ FN ∙ðÞ TN þ FP ∙ðÞ TN þ FN able and can be accessed at http://202.119.249.49.For clarity, F-scores are renormalized to a 1–100 range for The predicted PLIs were compared with those directly predicted sites. derived from 3D protein-ligand complex structures, and To predict PLIs, we defined a protein-ligand interaction precision and recall values were obtained to qualify PLI scoring-vector function I ={NCOV,NCOO,NHBA,NHBD , predictions. i i i i i NPI , NELE ,NVDW}for the ith residue of the query i i i protein following Eq. (1): Results and discussion X The mimivirus sulfhydryl oxidase R596 j j j j I ¼ N E C V ð2Þ 0 0 The 292aa mimivirus sulfhydryl oxidase R596 is target i i T0737 of CASP10, whose structure was later deter- j j j j j j where V ¼fNCOV ; NCOO ; NHBA ; NHBD ; NPI ; 0 0 0 0 0 0 i i i i i i mined at 2.21 Å (PDB entry 3TD7; see Fig. 3 [73]). The j j NELE ; NVDW g is the PLI vector for residue i in the 0 0 protein is composed of two all alpha-helix domains: the i i profile module j that matches the ith residue of the query N-terminal sulfhydryl oxidase domain (Erv domain) and sequence. For each prediction functional site, fiDPD will the C-terminal ORFan domain. The mimivirus enzyme determine an associated PLI vector according to Eq. (2), R596 has an EC number of EC1.8.3.2, catalyzing the for- which identifies the interactions involved with each pre- mation of disulfide bonds through an oxidation reaction diction site. For clarity, in the webserver, when I has a with the help of a cofactor of flavin adenine dinucleotide nonzero value from Eq. (2), it will be simply assigned as (FAD). FAD is tightly bonded to 22 residues in the cata- “1” to indicate a certain type of PLI. lytic pocket in the Erv domain [48], playing an important role in transferring electrons from a 10 Å distance shuttle Validation datasets disulfide in the flexible interdomain loop to the active-site The original fDPD was examined for PFS prediction using disulfide close to FAD in the Erv domain [73]. In the a few types of datasets, including two manually culti- prediction, fiDPD scanned the T0737 sequence against vated enzyme catalytic site datasets of the 140-enzyme thedatabaseand found4profilemoduleentries,all CATRES-FAM [68], the 94-enzyme Catalytic Site Atlas from the Apolipoprotein family with a structure of a (CSA-FAM) [56] and a 30-member small-molecular four-helical up-and-down bundle. The 4 entries include binding protein target from CSAP9 [69]. Here, we exam- an automated-match-domain profile built from 10 sequences ined fiDPD by calculating the PLIs of protein targets listed from Arabidopsis thaliana, a second automated-match- in CASP10 [70]and in CASP11 [49], whose ligand-binding domain profile built from 4 sequences from Rattus nor- complex structures had been solved. vegicus, an augmenter of liver regeneration domain profile built from 13 sequences from Rattus norvegicus, Validation method and a thiol-oxidase Erv2p domain profile built from 6 The conventional prediction precision and recall calcula- sequences from Saccharomyces cerevisiae. The scanning − 8 − 19 tions were used to evaluate the performance of our method: E-value ranges from 2 × 10 to 1 × 10 , indicating Precision = TP/(TP + FP) and Recall = TP/(TP + FN), where that the query sequence only has moderate similarity thetruepositives (TPs)are the predicted residues listed as with the annotated sequences in the database. A total functional sites in the dataset, the false positives (FPs) are of 56 annotated pivotal sites in the 4 fiDPD profile the predicted sites not listed in the dataset, and the false modules were then collected and sorted according to negatives (FNs) are the functional sites listed in the dataset their functional site scoring functions. When mapping but missed by the method. Another relevant quantity is the to the query sequence, 12 functional sites were then true negative (TN), which stands for the correctly predicted automatically identified, resulting in a 92% prediction nonbinding/nonfunctional site residues. In our calculations, precision and 57% recall. We also examined those func- the statistics did not take account of the “no-hit” predic- tional sites that fiDPD failed to identify and found that tions. The overall precision is the sum of all the TPs divided they are located in a different C-terminal domain than by the total number of predicted residues, and the overall the four-helical up-and-down bundle domain. recall is the sum of all the TPs divided by the total number To examine the PLI prediction, we first collected inter- of listed functional sites in the dataset. The precision-recall action scoring vectors associated with pivotal sites in the curve was found to be slightly dependent on the cutoff four profile modules according Eq. (2) and then compared percentage M% and T% in the selection method. The with those directly determined from the protein-ligand MCC [71] was used to assess the ligand-binding residue complex structure recorded in PDB entry 3TD7 (Table 1). predictions of the CASP10 target proteins [72]and is Figure 3 demonstrates key interactions predicted by defined as follows: Eq. (2) and those not found by the prediction. fiDPD Han et al. BMC Bioinformatics (2018) 19:204 Page 7 of 12 Fig. 3 Mapping the protein-ligand interactions predicted for the mimivirus sulfhydryl oxidase R596, target T0737, PDB code 3TD7. Dash lines represent PLIs, they are colored as following: blue for electrostatic interactions, green for π-stacking interactions, gray for van der Waals interactions, and red for interaction not found by fiDPD correctly predicted all the π-stacking interactions in- volving Trp45, His49, Tyr114, and His117, indicating that π-π interactions play a critically important role in ligand binding. The prediction also found significant π-stacking interactions on pivotal sites of Leu78 and Lys123; however, these π-π interaction predictions Table 1 The prediction of protein-ligand interactions on PFSs of were ignored in posttreatment simply because of the T0737† lack of aromatic side chains in these residues. fiDPD Target Site AA COV COO ELE HBD HBA π-π also found the correct electrostatic interactions on T0737 41 G 0 0 0 0 0 0 His117 and Lys123 sites. The algorithm identified a 42 T 0 0 +/0 T 0 0 large probability of electrostatic interactions on sites 45 W 0 0 0 T 0 T Thr42 and Val126; however, these interactions were ig- 49 H 0 0 0 0 + T nored in posttreatment since the involved residues are 78 L 0 0 0 0 0 0 not chargeable in the conventional conditions. In total, approximately 80% of the overall PLI predictions were 83 C 0 0 0 + T 0 associated with identified functional sites. 114 Y 0 0 0 0 T T 117 H 0 0 T + – T 118 N 0 0 0 + T 0 CASP10 and CASP11 targets 120 V 0 0 0 0 0 0 We applied fiDPD to protein targets listed in CASP10 121 N 0 0 0 0 + 0 and CASP11, of which 13 targets had been solved with explicit bound ligands [48]. Table 2 lists all the predic- 123 K 0 0 T T + +/0 tions, of which fiDPD gave a no-hit for 3 target proteins. †AA stands for amino acid, COV for covalent bond, COO for coordinate bond, ELE for electrostatic interaction, HBD for H-bond donor, HBA for H-bond For the remaining 10 predictions, fiDPD gave an overall acceptor, π-π for π-stacking interactions. “0” indicates the corresponding precision of 64% and an overall recall of 46% using a interaction is not present in protein-ligand complex structure and fiDPD calculation also showed no such type PLIs on the site scale selection with T of 45% and M of 35%. The Han et al. BMC Bioinformatics (2018) 19:204 Page 8 of 12 Table 2 Ligand-binding sites predictions of CASP10/11 targets proteins† Target PDB Ligand Type Sites* Prediction TP Precision Recall MCC T0652 4HG0 AMP Non-metal 11 17 6 0.35 0.55 0.41 T0657 2LUL ZN Metal 5 9 4 0.44 0.8 0.58 T0659 4ESN ZN Metal 3 No-hit T0675 2LV2 ZN Metal 8 9 8 0.89 1 0.94 T0686 4HQL MG Metal 5 6 3 0.5 0.6 0.54 T0696 4RT5 NA Metal 6 3 1 0.33 0.17 0.21 T0697 4RIT TRS Non-metal 6 11 0 0 0 0 T0706 4RCK MG Metal 5 3 3 1 0.6 0.77 T0720 4IC1 MN/SF4 Metal 14 No-hit T0721 4FK1 FAD Non-metal 29 3 3 1 0.1 0.31 T0726 4FGM ZN Metal 7 No-hit T0737 3TD7 FAD Non-metal 21 13 12 0.92 0.57 0.71 T0744 2YMV FNR Non-metal 19 4 4 1 0.21 0.45 † Target 762 to 854 were taken from CASP11 whose protein-ligand interactions were well characterized in the crystal structures *“Sites” is the number of ligand-binding sites recorded in PDB files of the target protein averaged MCC of the predictions was 0.49. Considering MCC of 0.57 for the studied target proteins. Six LIBRA the ligand-binding types, we found that fiDPD provided predictions were based on the known sites of the PDB better functional site predictions for metal binding sites structures of the target proteins themselves and contrib- with an average MCC value of 0.68, while it was 0.38 for uted a higher average MCC value of 0.80. For COACH, nonmetal binding site prediction, indicating that PFSs whose prediction is sequence based, the average MCC are more conservative with respect to either spatial ar- was 0.58, of which 2 predictions were based on the rangement or sequence location in metal binding. known sites of the target PDB structures. We observed We compared the performance of fiDPD with the re- that, except for T0675 and T0697, COACH had already cently published ligand-binding site prediction methods used the target PDB structures as templates in building LIBRA [74] (Table 3) and COACH [75, 76] (Table 4). structures from input target protein sequences. Taken LIBRA aligns the structures of input proteins with a col- together, COACH performed best, while fiDPD’s per- lection of known functional sites and gives an averaged formance (the present version of the database fiDPD Table 3 Prediction performance of LIBRA* Target PDB Length Sites LIBRA Rank-1 LIBRA Rank-2 Prediction TP Model MCC Prediction TP Model MCC T0652 4HG0 292 11 7 1 N 0.08 8 7 N 0.74 T0657 2LUL 154 5 4 4 Y 0.89 4 0 N 0 T0659 4ESN 72 3 3 3 Y 1 3 0 N 0 T0675 2LV2 74 8 4 4 Y 0.69 4 4 N 0.69 T0686 4HQL 242 5 3 3 Y 0.77 3 3 Y 0.77 T0696 4RT5 111 6 7 0 N 0 5 0 N 0 T0697 4RIT 483 6 14 0 N 0 5 0 N 0 T0706 4RCK 217 5 3 0 N 0 8 1 N 0.14 T0720 4IC1 202 8 4 4 Y 0.7 5 0 N 0 T0721 4FK1 301 29 24 23 N 0.86 23 2 N 0.01 T0726 4FGM 589 7 6 6 N 0.92 10 0 N 0 T0737 3TD7 292 21 10 10 N 0.67 6 0 N 0 T0744 2YMV 329 19 12 12 Y 0.78 2 2 Y 0.64 *LIBRA prediction was based on the input of the PDBs of the target proteins. “Sites” is the number of ligand-binding sites recorded in PDB files of the target protein. “Y” in “Model” indicates that the prediction was made based on binding pockets in the PDB of the target protein as the template. “N” when the PDB of the target protein was not used in prediction Han et al. BMC Bioinformatics (2018) 19:204 Page 9 of 12 Table 4 Prediction performance of COACH* Target PDB Length Sites COACH Rank-1 COACH Rank-2 Prediction TP Model MCC Prediction TP Model MCC T0652 4HG0 292 11 12 2 N 0.14 19 2 N 0.09 T0657 2LUL 154 5 7 0 N 0 5 5 Y 1 T0659 4ESN 72 3 3 3 N 1 8 0 N 0 T0675 2LV2 74 8 4 3 N 0.49 4 4 N 0.69 T0686 4HQL 242 5 4 3 N 0.66 13 0 N 0 T0696 4RT5 111 6 5 4 N 0.72 3 1 N 0.2 T0697 4RIT 483 6 12 0 N 0 5 0 N 0 T0706 4RCK 217 5 3 3 N 0.77 5 4 N 0.79 T0720 4IC1 202 8 5 4 Y 0.62 8 4 Y 0.48 T0721 4FK1 301 29 32 24 N 0.76 19 2 N 0.01 T0726 4FGM 589 7 10 6 N 0.71 10 3 N 0.35 T0737 3TD7 292 21 21 15 N 0.69 6 1 Y 0.05 T0744 2YMV 329 19 19 18 Y 0.94 7 4 N 0.32 *COACH built structures from the sequences of target proteins except for T0675 and T0697 by directly using the PDBs of the corresponding target proteins themselves. “Sites” is the number of ligand-binding sites recorded in PDB files of the target protein. “Y” in “Model” indicates that the prediction was made based on binding pockets in the PDB of the target protein as the template. “N” when the PDB of the target protein was not used in prediction does not contain target proteins except for T0675) was predicted PLIs and those calculated based on solved comparable with that of LIBRA, especially when known protein-ligand complex structures. Table 5 compared the sites of the target PDB structures were not used. predicted PLIs on functional sites with the experimental One of the key aspects of fiDPD predictions lies in the PLIs. In most cases, fiDPD can correctly identify 80% or identification of physicochemical interactions between more of the PLIs on functional sites. predicted binding sites and ligands. We examined the performance of the fiDPD prediction of PLIs in these Conclusions target proteins by determining the overlap between the In this paper, we present a new functional site- and physicochemical interaction-annotated domain profile database (fiDPD), from which we developed a sequence- Table 5 PLI predictions of CASP10/11 targets proteins† based method for predicting both PFSs and PLIs. Our Target Interactions Correct Prediction Recall method is based on the assumption that proteins that share T0652 60 36 60% similar structure and sequence tend to have similar func- T0657 24 23 95.80% tional sites located on the same positions on a protein’ssur- T0675 30 28 93.30% face. A profile module entry in fiDPD is representative of a bunch of annotated domain structures that share high se- T0686 18 17 94.40% quence and structure similarity. The fiDPD method first T0696 18 15 83.30% identifies profile modules in the database and then, as a T0697 104 72 69.20% prediction, maps the annotated pivotal sites and associated T0706 24 21 87.50% interactions of the module(s) to the residues of the query T0720 78 58 74.40% protein. T0721 60 50 83.30% In a previous study, we examined the fDPD method with a collection of catalytic sites from a standard dataset T0737 72 63 87.50% of the 140-enzyme CATRES-FAM [68] and found that the T0744 42 37 88.10% method provided an enzyme active-site prediction of 59% T0762 42 35 83.30% recall at a precision of 18.3%. For ligand-binding site pre- T0764 60 52 86.70% diction of target proteins in CASP9, the method obtained T0770 18 14 77.80% an averaged MCC of 0.56, ranking between 8th and 10th T0784 18 18 100% of the 33 participating groups [72]. In this study, fiDPD gives new prediction for physicochemical interactions T0854 24 20 83.30% associated with the predicted PFSs. Here, fiDPD was † Target 762 to 854 were taken from CASP11 whose protein-ligand interactions were well characterized in the crystal structures applied to predict the functional sites of 10 target Han et al. BMC Bioinformatics (2018) 19:204 Page 10 of 12 proteins in CASP10 and CASP11 that have been solved designed the webserver. DM wrote the paper. All authors read and approved the final manuscript. in a ligand-bound state and achieved an averaged MCC of 0.66. When compared with the solved 3D complex Ethics approval and consent to participate structures, we found that the predicted PLIs correctly Not applicable. overlapped 80% of the true PLIs. Our calculations indi- cate that the PLIs are well-conserved biochemical prop- Competing interests erties during protein evolution and that it is possible to The authors declare that they have no competing interests. assign accurate PLIs to predicted PFSs using an anno- tated database. fiDPD demonstrates that atomic physi- Publisher’sNote cochemical interactions between proteins and ligands Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. can be reliably identified from protein sequences. fiDPD is improvable. First, new annotations could be Author details assigned to fiDPD to add new types of predictions. For Department of Physiology and Biophysics, School of Life Science, Fudan University, Shanghai 200438, People’s Republic of China. College of example, adding annotations of enzyme catalytic sites Biotechnology and Pharmaceutical Engineering, Nanjing Tech University, (CSA), ligand-specific models, such as zinc-binding Biotech Building Room B1-404, 30 South Puzhu Road, Jiangsu 211816 sites or RNA-binding sites, should endow fiDPD with Nanjing, People’s Republic of China. the corresponding capability to predict catalytic sites, Received: 19 July 2017 Accepted: 15 May 2018 zinc-binding sites or RNA-binding sites. Annotations of fiDPD modules using other resources, such as dynamic simulations, FDPA calculations [32], pocket druggability References [77], drug-target interactions (DTIs), drug modes of action 1. Konc J, Janezic D. Binding site comparison for function prediction and pharmaceutical discovery. Curr Opin Struct Biol. 2014;25:34–9. [78], etc., should provide new content for fiDPD predic- 2. Perot S, Sperandio O, Miteva MA, Camproux AC, Villoutreix BO. Druggable tions that involve the protein dynamics and drug activity pockets and binding site centric chemical space: a paradigm shift in drug in PLIs. Second, considering that the classification of discovery. Drug Discov Today. 2010;15(15–16):656–67. 3. Xie L, Xie L, Bourne PE. Structure-based systems biology for analyzing off- binding sites plays a key role in drug discovery and design, target binding. Curr Opin Struct Biol. 2011;21(2):189–99. it would be interesting to use the clustering sites [79, 80] 4. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, instead of the intact SITE information to annotate the Bourne PE. The Protein Data Bank. Nucleic Acids Res. 2000;28(1):235–42. 5. Dukka BK. Structure-based methods for computational protein functional database, which might make the prediction more useful. site prediction. Computational and structural biotechnology journal. As a knowledge-based method, the utility and efficiency of 2013;8:e201308005. fiDPD prediction suffers from the sampling limitation of 6. Capra JA, Singh M. Characterization and prediction of residues determining protein functional specificity. Bioinformatics. 2008;24(13):1473–80. annotations of known proteins. This sampling problem 7. Manning JR, Jefferson ER, Barton GJ. The contrasting properties of conservation might be partially solved with large-scale protein sequen- and correlated phylogeny in protein functional residue prediction. BMC cing efforts and worldwide structural genomics projects. Bioinformatics. 2008;9:51. 8. Wilkins A, Erdin S, Lua R, Lichtarge O. Evolutionary trace for prediction and Abbreviations redesign of protein functional sites. Methods Mol Biol. 2012;819:29–42. CASP: Critical Assessment of Structure Prediction; FDPA: Fast dynamics 9. Fischer JD, Mayer CE, Soding J. Prediction of protein functional residues from perturbation analysis; fiDPD: Function-site- and physicochemical interaction- sequence by probability density estimation. Bioinformatics. 2008;24(5):613–20. annotated domain-profile-database; HMM: Hidden Markov Model; MCC: Matthews 10. Liang S, Zhang C, Liu S, Zhou Y. Protein binding site prediction using an correlation coefficient; MSA: Multiple sequence alignment; PFS: Protein empirical scoring function. Nucleic Acids Res. 2006;34(13):3698–707. functional site; PLI: Protein-ligand interaction; RMSD: Root-mean-square- 11. Chelliah V, Taylor WR. Functional site prediction selects correct protein distance; SCOPe: Structural classification of proteins—extended models. BMC Bioinformatics. 2008;9(Suppl 1):S13. 12. Berezin C, Glaser F, Rosenberg J, Paz I, Pupko T, Fariselli P, Casadio R, Ben-Tal N. Acknowledgements ConSeq: the identification of functionally and structurally important residues in This work began when one of the author (DM) visited CNLS in Los Alamos protein sequences. Bioinformatics. 2004;20(8):1322–4. National Laboratory. DM thanks Michael Wall for helpful discussions in early 13. Fetrow JS, Skolnick J. Method for prediction of protein function from days of this work. We also appreciated professor Rupu Zhao in Nanjing Tech sequence using the sequence-to-structure-to-function paradigm with University for helpful comments. application to glutaredoxins/thioredoxins and T1 ribonucleases. J Mol Biol. 1998;281(5):949–68. Funding 14. Gherardini PF, Helmer-Citterich M. Structure-based function prediction: This work was supported, in part, by the National Key Research and Development approaches and applications. Brief Funct Genomic Proteomic. 2008;7(4):291–302. Program of China for key technology of food safety (2017YFC1600900) and by the 15. Ausiello G, Via A, Helmer-Citterich M. Query3d: a new method for high- Key University Science Research Project of Jiangsu Province (Grant No. 17KJA180005). throughput analysis of functional residues in protein structures. BMC The funding body did neither contribute to the design of the study nor to Bioinformatics. 2005;6(Suppl 4):S5. collection, analysis and interpretation of the data nor to writing of the manuscript. 16. Barker JA, Thornton JM. An algorithm for constraint-based structural template matching: application to 3D templates with statistical analysis. Availability of data and materials Bioinformatics. 2003;19(13):1644–9. The method is freely available and can be accessed at: http://202.119.249.49. 17. Glaser F, Morris RJ, Najmanovich RJ, Laskowski RA, Thornton JM. A method for localizing ligand binding pockets in protein structures. Proteins. Authors’ contributions 2006;62(2):479–88. DM designed the work. DM and MH wrote the code of fiDPD program. MH 18. Brady GP Jr, Stouten PF. Fast prediction and visualization of protein binding performed the computational experiments and analyze the data. YS and JQ pockets with PASS. J Comput Aided Mol Des. 2000;14(4):383–401. Han et al. BMC Bioinformatics (2018) 19:204 Page 11 of 12 19. Tong W, Williams RJ, Wei Y, Murga LF, Ko J, Ondrechen MJ. Enhanced 47. Roche DB, Brackenridge DA, McGuffin LJ. Proteins and their interacting performance in prediction of protein active sites with THEMATICS and partners: an introduction to protein-ligand binding site prediction methods. support vector machines. Protein Sci. 2008;17(2):333–41. Int J Mol Sci. 2015;16(12):29829–42. 20. Elcock AH. Prediction of functionally important residues based solely on the 48. Gallo Cassarino T, Bordoli L, Schwede T. Assessment of ligand binding site computed energetics of protein structure. J Mol Biol. 2001;312(4):885–96. predictions in CASP10. Proteins. 2014;82(Suppl 2):154–63. 21. Kahraman A, Morris RJ, Laskowski RA, Favia AD, Thornton JM. On the 49. Moult J, Fidelis K, Kryshtafovych A, Schwede T, Tramontano A. Critical diversity of physicochemical environments experienced by identical ligands assessment of methods of protein structure prediction (CASP) - progress in binding pockets of unrelated proteins. Proteins. 2010;78(5):1120–36. and new directions in round XI. Proteins. 2016; 22. Coleman RG, Burr MA, Souvaine DL, Cheng AC. An intuitive approach to 50. Fox NK, Brenner SE, Chandonia JM. SCOPe: structural classification of measuring protein surface curvature. Proteins. 2005;61(4):1068–74. proteins–extended, integrating SCOP and ASTRAL data and classification of 23. Nayal M, Honig B. On the nature of cavities on protein surfaces: application new structures. Nucleic Acids Res. 2014;42(Database issue):D304–9. to the identification of drug-binding sites. Proteins. 2006;63(4):892–906. 51. Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP: a structural 24. Petrova NV, Wu CH. Prediction of catalytic residues using support vector classification of proteins database for the investigation of sequences and machine with selected protein sequence and structural properties. BMC structures. J Mol Biol. 1995;247(4):536–40. Bioinformatics. 2006;7:312. 52. Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam 25. Rossi A, Marti-Renom MA, Sali A. Localization of binding sites in protein H, Valentin F, Wallace IM, Wilm A, Lopez R, et al. Clustal W and Clustal X structures by optimization of a composite scoring function. Protein Sci. version 2.0. Bioinformatics. 2007;23(21):2947–8. 2006;15(10):2366–80. 53. Edgar RC. MUSCLE: a multiple sequence alignment method with reduced 26. Sankararaman S, Sha F, Kirsch JF, Jordan MI, Sjolander K. Active site prediction time and space complexity. BMC Bioinformatics. 2004;5:113. using evolutionary and structural information. Bioinformatics. 2010;26(5):617–24. 54. Eddy SR. A new generation of homology search tools based on 27. Somarowthu S, Yang H, Hildebrand DG, Ondrechen MJ. High-performance probabilistic inference. Genome Inform. 2009;23(1):205–11. prediction of functional residues in proteins with machine learning and 55. An XB, Wu XK, Ming DM. Sequece-based functional sites prediction computed input features. Biopolymers. 2011;95(6):390–400. from a function annotated protein domain profile database. J Fudan U. 28. Roche DB, Buenavista MT, McGuffin LJ. FunFOLDQA: a quality assessment 2013;52(6):768–78. tool for protein-ligand binding site residue predictions. PLoS One. 56. Porter CT, Bartlett GJ, Thornton JM. The catalytic site atlas: a resource of 2012;7(5):e38219. catalytic sites and residues identified in enzymes using structural data. 29. Ma B, Wolfson HJ, Nussinov R. Protein functional epitopes: hot spots, Nucleic Acids Res. 2004;32(Database issue):D129–33. dynamics and combinatorial libraries. Curr Opin Struct Biol. 2001;11(3):364–9. 57. Heazlewood JL, Durek P, Hummel J, Selbig J, Weckwerth W, Walther D, 30. Yang LW, Bahar I. Coupling between catalytic site and collective dynamics: Schulze WX. PhosPhAt: a database of phosphorylation sites in Arabidopsis a requirement for mechanochemical activity of enzymes. Structure. thaliana and a plant-specific phosphorylation site predictor. Nucleic Acids 2005;13(6):893–904. Res. 2008;36(Database issue):D1015–21. 58. McDonald IK, Thornton JM. Satisfying hydrogen bonding potential in 31. Liu T, Whitten ST, Hilser VJ. Functional residues serve a dominant role in proteins. J Mol Biol. 1994;238(5):777–93. mediating the cooperativity of the protein ensemble. Proc Natl Acad Sci U S A. 2007;104(11):4347–52. 59. Honig B, Nicholls A. Classical electrostatics in biology and chemistry. 32. Ming D, Cohn JD, Wall ME. Fast dynamics perturbation analysis for Science. 1995;268(5214):1144–9. prediction of protein functional sites. BMC Struct Biol. 2008;8:5. 60. Davis ME, Mccammon JA. Electrostatics in biomolecular structure and 33. Ming D, Wall ME. Quantifying allosteric effects in proteins. Proteins. dynamics. Chem Rev. 1990;90(3):509–21. 2005;59(4):697–707. 61. Dykstra CE. Electrostatic interaction potentials in molecular-force fields. 34. Ming D, Wall ME. Interactions in native binding sites cause a large change Chem Rev. 1993;93(7):2339–53. in protein dynamics. J Mol Biol. 2006;358(1):213–23. 62. Burley SK, Petsko GA. Aromatic-aromatic interaction: a mechanism of protein structure stabilization. Science. 1985;229(4708):23–8. 35. Fukunishi Y, Nakamura H. Prediction of ligand-binding sites of proteins by molecular docking calculation for a random ligand library. Protein Sci. 63. Muller-Dethlefs K, Hobza P. Noncovalent interactions: a challenge for 2011;20(1):95–106. experiment and theory. Chem Rev. 2000;100(1):143–68. 36. Heo L, Shin WH, Lee MS, Seok C. GalaxySite: ligand-binding-site prediction by 64. Ma B, Elkayam T, Wolfson H, Nussinov R. Protein-protein interactions: using molecular docking. Nucleic Acids Res. 2014;42(Web Server issue):W210–4. structurally conserved residues distinguish between binding sites and 37. Nerukh D, Okimoto N, Suenaga A, Taiji M. Ligand diffusion on protein exposed protein surfaces. Proc Natl Acad Sci U S A. 2003;100(10):5772–7. surface observed in molecular dynamics simulation. J Phys Chem Lett. 65. MacKerell AD, Bashford D, Bellott M, Dunbrack RL, Evanseck JD, Field MJ, Fischer S, 2012;3(23):3476–9. Gao J, Guo H, Ha S, et al. All-atom empirical potential for molecular modeling 38. Chen K, Kurgan L. Investigation of atomic level patterns in protein–small and dynamics studies of proteins. J Phys Chem B. 1998;102(18):3586–616. ligand interactions. PLoS One. 2009;4(2):e4473. 66. Matsuyama T, Yamashita T, Imai H, Shichida Y. Covalent bond between 39. Durrant JD, McCammon JA. BINANA: a novel algorithm for ligand-binding ligand and receptor required for efficient activation in rhodopsin. J Biol characterization. J Mol Graph Model. 2011;29(6):888–93. Chem. 2010;285(11):8114–21. 67. Eddy SR. Accelerated profile HMM searches. PLoS Comput Biol. 2011;7(10): 40. Kasahara K, Shirota M, Kinoshita K. Comprehensive classification and e1002195. diversity assessment of atomic contacts in protein-small ligand interactions. J Chem Inf Model. 2013;53(1):241–8. 68. Bartlett GJ, Porter CT, Borkakoti N, Thornton JM. Analysis of catalytic residues 41. Salentin S, Haupt VJ, Daminelli S, Schroeder M. Polypharmacology rescored: in enzyme active sites. J Mol Biol. 2002;324(1):105–21. protein-ligand interaction profiles for remote binding site similarity 69. Moult J, Fidelis K, Kryshtafovych A, Tramontano A. Critical assessment of methods assessment. Prog Biophys Mol Biol. 2014;116(2–3):174–86. of protein structure prediction (CASP)–round IX. Proteins. 2011;(79 Suppl):10:1–5. 42. Desaphy J, Raimbaud E, Ducrot P, Rognan D. Encoding protein-ligand 70. Moult J, Fidelis K, Kryshtafovych A, Schwede T, Tramontano A. Critical interaction patterns in fingerprints and graphs. J Chem Inf Model. 2013; assessment of methods of protein structure prediction (CASP)–round x. 53(3):623–37. Proteins. 2014;82(Suppl 2):1–6. 43. Wang SH, Wu YT, Kuo SC, Yu J. HotLig: a molecular surface-directed approach 71. Matthews BW. Comparison of the predicted and observed secondary to scoring protein-ligand interactions. J Chem Inf Model. 2013;53(8):2181–95. structure of T4 phage lysozyme. Biochim Biophys Acta. 1975;405(2):442–51. 44. Schreyer AM, Blundell TL: CREDO: a structural interactomics database for 72. Schmidt T, Haas J, Gallo Cassarino T, Schwede T. Assessment of ligand- drug discovery. Database (Oxford) 2013, 2013:bat049. binding residue predictions in CASP9. Proteins. 2011;79 Suppl 10:126–36. 45. Liu Z, Li Y, Han L, Li J, Liu J, Zhao Z, Nie W, Liu Y, Wang R. PDB-wide 73. Hakim M, Ezerina D, Alon A, Vonshak O, Fass D. Exploring ORFan domains in collection of binding data: current status of the PDBbind database. giant viruses: structure of mimivirus sulfhydryl oxidase R596. PLoS One. Bioinformatics. 2015;31(3):405–12. 2012;7(11):e50649. 46. Salentin S, Schreiber S, Haupt VJ, Adasme MF, SchroederM. PLIP: fully 74. Toti D, Le VH, Tortosa V, Brandi V, Polticelli F. LIBRA-WA: a web application automated protein-ligand interaction profiler. Nucleic Acids Res. for ligand binding site detection and protein function recognition. 2015;43(W1):W443–7. Bioinformatics. 2017; Han et al. BMC Bioinformatics (2018) 19:204 Page 12 of 12 75. Yang J, Roy A, Zhang Y. BioLiP: a semi-manually curated database for biologically relevant ligand-protein interactions. Nucleic Acids Res. 2013; 41(Database issue):D1096–103. 76. Yang J, Yan R, Roy A, Xu D, Poisson J, Zhang Y. The I-TASSER suite: protein structure and function prediction. Nat Methods. 2015;12(1):7–8. 77. Hussein HA, Borrel A, Geneix C, Petitjean M, Regad L, Camproux AC. PockDrug-server: a new web server for predicting pocket druggability on holo and apo proteins. Nucleic Acids Res. 2015;43(W1):W436–42. 78. Wang YH, Zeng JY. Predicting drug-target interactions using restricted Boltzmann machines. Bioinformatics. 2013;29(13):126–34. 79. Ivan G, Szabadka Z, Grolmusz V. A hybrid clustering of protein binding sites. FEBS J. 2010;277(6):1494–502. 80. Szabadka Z, Grolmusz V. Building a structured PDB: the RS-PDB database. Conf Proc IEEE Eng Med Biol Soc. 2006;1:5755–8. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png BMC Bioinformatics Springer Journals

Sequence-based prediction of physicochemical interactions at protein functional sites using a function-and-interaction-annotated domain profile database

Free
12 pages

Loading next page...
 
/lp/springer_journal/sequence-based-prediction-of-physicochemical-interactions-at-protein-mdL1lXq7qu
Publisher
Springer Journals
Copyright
Copyright © 2018 by The Author(s).
Subject
Life Sciences; Bioinformatics; Microarrays; Computational Biology/Bioinformatics; Computer Appl. in Life Sciences; Algorithms
eISSN
1471-2105
D.O.I.
10.1186/s12859-018-2206-2
Publisher site
See Article on Publisher Site

Abstract

Background: Identifying protein functional sites (PFSs) and, particularly, the physicochemical interactions at these sites is critical to understanding protein functions and the biochemical reactions involved. Several knowledge-based methods have been developed for the prediction of PFSs; however, accurate methods for predicting the physicochemical interactions associated with PFSs are still lacking. Results: In this paper, we present a sequence-based method for the prediction of physicochemical interactions at PFSs. The method is based on a functional site and physicochemical interaction-annotated domain profile database, called fiDPD, which was built using protein domains found in the Protein Data Bank. This method was applied to 13 target proteins from the very recent Critical Assessment of Structure Prediction (CASP10/11), and our calculations gave a Matthews correlation coefficient (MCC) value of 0.66 for PFS prediction and an 80% recall in the prediction of the associated physicochemical interactions. Conclusions: Our results show that, in addition to the PFSs, the physical interactions at these sites are also conserved in the evolution of proteins. This work provides a valuable sequence-based tool for rational drug design and side-effect assessment. The methodisfreelyavailable andcan be accessedat http://202.119.249.49. Keywords: Physicochemical interaction prediction, Protein functional site prediction, fiDPD, Hidden Markov model, Domain profile module Background physicochemical interactions, which is indispensable in- Most proteins perform biological functions via interactions formation for understanding protein biochemical reactions. with their partners, such as small molecules or ligands, Together with PFS prediction, accurate protein-ligand DNA/RNA, and other proteins, forming instantaneous or interaction (PLI) prediction opens up a new dimension in permanent complex structures. Of particular importance is correctly annotating protein function and thus provides that only a few pivotal amino acids on a protein’s surface, valuable information for rational drug design and drug usually called protein functional sites (PFSs), play key roles side-effect assessment [1–3]. To date, 3D protein-partner in determining these interactions. Thus, understanding complex structures have been the main source of know- protein functions depends upon accurate predictions of ledge about PFSs and PLIs. In recent years, in silico PFSs. However, PFSs alone do not reveal the details of their methods have received increasing attention as an alterna- tive strategy for protein function annotation, especially in * Correspondence: dming@njtech.edu.cn predicting PFSs. The advantage of these methods stems College of Biotechnology and Pharmaceutical Engineering, Nanjing Tech from two factors: the rapid accumulation of a large number University, Biotech Building Room B1-404, 30 South Puzhu Road, Jiangsu 211816 Nanjing, People’s Republic of China of complex 3D structures in publicly accessible databases Full list of author information is available at the end of the article © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Han et al. BMC Bioinformatics (2018) 19:204 Page 2 of 12 such as the Protein Data Bank (PDB) [4]and therapid the prediction. A profile hidden Markov model of the development of computer technology and computation HMMER program was used in the prediction to search a algorithms. module member of the database for a given protein. We In the last few decades, many computational methods applied the fiDPD method to 10 target proteins of CASP10 have emerged to identify PFSs from protein structures and [48] and CASP11 [49] and found that the method has a sequences [5]. Most sequence-based methods assume that Matthews correlation coefficient (MCC) value of 0.66 for functionally important residues are conserved through PFS prediction. Additionally, the model provided a cor- evolution and can be identified as conserved sites based rect physicochemical interaction prediction for 80% of on multiple sequence alignment (MSA) within homolo- theexaminedsites.Weexpect the present method to gous protein families [6–8]. Sequence-based information be a valuable auxiliary tool for conventional bioinformatic such as secondary structure propensity and the likely and protein function annotations. solvent accessible surface area (SASA) have also been used to improve the prediction [9–12]. In addition, Methods structure-based methods that essentially determine local Figure 1 shows the flow chart used to build fiDPD. We or overall structural similarity have been developed for first introduced the fDPD as a list of representative profile PFS prediction [13–16]. Typical local structural features modules built by sorting out structure-and-sequence include large clefts on protein surfaces [17, 18], special similar protein domains in the SCOP databases [50]. spatial arrangements of catalytic residues [19–21], and Next, PFSs and atomic patterns of PLIs were derived particular patterns between surface residues [22, 23]. from known protein-ligand-complex structures in the Other prediction methods have used both structural and PDB; then, after a series of site-to-site mappings, these sequence information [24, 25] and might, when combined structures were used to annotate fDPD profile modules with artificial intelligence techniques, provide encouraging and thus to build the fiDPD. results [26–28]. Other methods based on protein dynam- ics [29–34], conventional molecular dynamics and dock- fDPD was prepared based on the subgroup classification ing simulations [35–37]havealsobeen successfulinPSF of domain entries of the SCOP database prediction. To elucidate the physicochemical interactions We started with a modified classification of protein do- between proteins and their partners, particularly those be- main structures collected in the SCOP database [50, 51]. tween protein and ligands, researchers have attempted to In SCOP, a large protein structure is often manually di- characterize these interactions as early as the emergence vided into a few smaller parts or domains according to of the first protein-ligand complex structure. However, their spatial arrangement within the protein. A recent only very recently have structural bioinformatic tools version of SCOPe 2.05 was downloaded from http:// emerged with which to systematically characterize pro- scop.berkeley.edu/references/ver=2.05, which includes tein-ligand interactions (PLIs) [38–43] due to the rapid ac- 214,547 domain entries extracted from 75,226 protein cumulation of protein complex structures. Additionally, a structures in the PDB. In SCOP, these domain structures few databases record detailed atomic interactions be- are arranged in a hierarchical 7-level system—Class (cl), tween proteins and ligands, facilitating PLI studies Fold (cf), Superfamily (sf), Family (fa), Protein Domain [44–46]. These data provide new resources for the (dm), Species (sp), and PDB code identity (px)—according large-scale characterization of physicochemical inter- to their sequence, function and structure similarity. Spe- actions between proteins and their partners and have cifically, those domains listed in a given domain entry helped improve conventional docking simulation and (dm) presumably share the same class, fold, superfamily pharmacology research. Several knowledge-based or and protein family but might differ in species and PDB ab initio methods have been developed for the prediction code entry. Theoretically, PFSs are more likely to be of PFSs; however, an accurate method for predicting conserved when they share both higher structural and the physicochemical interactions associated with PFSs sequential similarity, and this assumption forms the basis is still lacking [47]. for our algorithm of fiDPD in the prediction of PFSs In this paper, we develop a new method for predicting and PLIs. Using a profile hidden Markov model of the physical interactions occurring on functional sites based HMMER program, the MSA of all the domains within on the amino acid sequences of given proteins. This the same dm entry gives a single representative profile sequence-based method first predicts PFSs from a func- module. In this way, 12,527 representative profile mod- tional site-annotated domain profile database, or fDPD, ules were created for all the dm entries, forming the and then assigns the types of interactions most likely to basis of fDPD and fiDPD. appear at the predicted sites. In this study, we derived a In building fDPD, it is important for protein domains functional site- and interaction-annotated domain profile within the same dm entry to be structurally and sequen- database, called fiDPD, which plays the primary role in tially close to one another. However, a quick calculation Han et al. BMC Bioinformatics (2018) 19:204 Page 3 of 12 Fig. 1 Flow-chart for building the function-site- and interaction-annotated domain profile database (fiDPD) and for predicting protein function- sites and PLIs using fiDSPD reveals that the C root-mean-square-distance (RMSD) MUSCLE [53], from which a profile module was then can be as large as 12 Å for many domain structures built using the hmmbuild module of the HMMER pro- listed in the same dm entry. This result indicates that gram (http://hmmer.org/ [54]). A profile module is a se- there are many domains listed in the same dm entry of quence of hypothetical amino acids, which is, instead of SCOPe 2.05 that have quite different structures, which conventional amino acids, probably a mixture of certain makes the profile modules of fDPD less representative of amino acids according to the MSA of the subgroup. For member proteins within the dm entry. To reduce the each individual position in a profile module, we defined difference, we divided the domains within a dm entry a conservation value C according to the MSA. We into a few smaller groups or subgroups so that selected assigned the C value as 0, 1, 3, or 4 for a position being domains within the same subgroup would have mutual nonconservative, minimally conservative, conservative C -RMSD < 7 Å and a mutual sequence similarity > 10 and highly conservative, as indicated respectively by a (a score calculated by the MSA program CLUSTALW gap, “+” symbol, a lowercase letter or a capital letter in [52]). Thus, derived subgroups then replace the dm the MUSCLE alignment. We also defined an overall vol- entry as the basic unit of fDPD. fDPD contains 16,559 ume value N for a profile module as the number of pro- subgroups, which is 32% more than the original SCOP tein domains listed in the subgroup: a larger N value dm entries, with approximately 12 member structures in usually indicates that more information is available for each subgroup, on average. that subgroup and thus a greater confidence on the annotation. fDPD is composed of functional site annotated protein A scoring function S was assigned to each position in profile modules based on multiple subgroup-protein an fDPD profile module to mark its propensity of being sequence alignment a functional site. To this end, we first mapped known In fDPD, sequences of protein domains in a subgroup functional sites of member proteins within the same were extracted and aligned using the MSA program subgroup to the profile module according to the MSA Han et al. BMC Bioinformatics (2018) 19:204 Page 4 of 12 To annotate the profile modules with PLIs, atomic interaction patterns between the protein and ligand were initially determined based on their 3D protein-ligand complex structures. Specifically, the atomic 3D coordi- nates of amino acids listed in PDB SITE sections and those of ligand molecules were filtered out from the PDB files; then, a series of atomic distances (d)werecalculated between PFSs (A ) and ligands (A ). Finally, a few Site Ligand types of bonding and nonbonding interactions for each A were determined based on the pairwise distances and Site Fig. 2 Mapping known protein function sites and interactions to a the biochemical properties of involved amino acids. domain-profile module, ⊗: known PFSs of domain structures, ⊙: pivotal PFSs in a profile module with the number indicating a H-bond weight factor, *: PFSs mapped into the query protein sequence from Almost all PLIs occur in aqueous environments, where profile module pivotal sites, which, after a filtering, is reduced to two water molecules play a critical role. As a result, hydrogen points (A and B) as a final prediction output, Δ: non-conservative pivotal sites mapped into the query protein, which will be ignored bonds might be consistently established and destroyed due to the low conservation value until a certain stable protein-ligand configuration is achieved. Here, we have calculated hydrogen bonds within (see Fig. 2). Functional sites of member proteins were the protein-ligand complex using the program HBPLUS collected from the SITE sections of the corresponding [58]. The program determines H-bond donor (D) and ac- PDB file. Of the 202,705 protein domains listed in ceptor (A) atom pairs based on a nonhydrogen atom con- SCOPe, 132,725 domain structures have a total of figuration using a maximum H–A distance of 2.5 Å, a 1,878,004 functional sites annotated in PDB SITE re- maximum D–A distance of 3.9 Å, a minimum D–H–A cords. Then, for simplicity, we assigned S as the total hit angle of 90° and a minimum H–A–AA angle of 90°, where number that a profile module position received based on H is the theoretical hydrogen atom and AA is the atom of the MSA. Thus, the larger a position’s S-value, the more functional sites in the H-bond acceptor. In this way, we likely it is to be a hypothetical functional site for the defined NHBA and NHBD as the total number of H-bond profile module. In this way, the profile modules were an- acceptors and H-bond donors, respectively, associated notated with known PFSs, and we called the database with atoms in a given functional site. composed of these profile modules the function-site- annotated domain profile database, or fDPD. Previously, Electrostatic interactions alternative functional site annotations for profile modules Electrostatic force plays important roles in many PLIs were also built by using different “known” PFSs derived and might be the main driving force to initiate catalytic from FDPA calculations instead of those recorded active reactions, to guide the recognition between protein and sites in the PDB database [55]. Compared with the dm en- ligand, and so on [59–61]. However, accurately deter- tries in the original SCOP, in fDPD, PFSs should be more mining atomic charges in bio-structure is a very challen- likely to be conserved since they share both higher struc- ging task since it is highly sensitive to the surrounding tural and higher sequential similarity. environment. Here, for simplicity, we identified electro- static interactions simply by examining the charging status of contact atoms in PLIs. Specifically, we first se- fiDPD was built by attaching physicochemical interaction lected positively charged nitrogen (N) atoms of func- annotations to functional sites in fDPD profile modules tionalsitesof Arg,His,andLysand then determined Obviously, the abovementioned S-value is heavily an electrostatic interaction if there a neighboring (< 4.5 Å) dependent on the means by which the “known” PFSs oxygen atom was present in the ligand, which is not part were determined. In this work, S-values are determined by of a cyclized structure. An electrostatic interaction was using only PDB SITE information, which, in most cases, also built when a negatively charged oxygen (O) atom is composed of manually prepared ligand-binding sites. from Asp and Glu residues was found near a ligand Other types of biologically relevant functional site data, nitrogen atom. We used NELE as the total number of such as enzyme active sites [56] and phosphorylation sites electrostatic interactions involving atoms in a given func- [57], might also be used in the annotation. Here, consider- tional site. ing the importance of PLIs in determining protein func- tion, we added PLI annotations to the profile modules of π-stacking interactions fDPD to build the function-site and interaction-annotated π-Stacking interactions play a critical role in orientating domain profile database, or fiDPD. ligands inside binding pockets. We first identified the Han et al. BMC Bioinformatics (2018) 19:204 Page 5 of 12 aromatic side chains of Trp, Phe, Tyr and His of PFSs different pivotal sites of the profile module according to and carbon-dominant cyclized structures of ligands. Usu- the MSA of the studied subgroup. As a result, each ally, aromatic rings form an effective π-stacking inter- fDPD profile module was annotated with interaction action when they get close enough (4.5–7 Å) and have vectors V on hypothetical functional sites, thus forming either a parallel or perpendicular orientation [62, 63]. the fiDPD. Here, for simplicity, we defined a π-stacking interaction if we could find three or more distinct heavy-atom pairs be- fiDPD predicts both functional sites and PLIs using a tween atoms from the aromatic ring of a given functional hidden Markov model site and those from ligand carbon-ring structures. We de- fiDPD is essentially a list of profile module entries anno- fined the total number of π-stacking interactions involving tated with domain functional sites and PLIs. In fiDPD, a given functional site as NPI. two steps are required to predict the hypothetical func- tional sites and involved PLIs for a given inquiry protein: Van der Waals interaction 1) identifying profile modules in fiDPD that match the A Van der Waals interaction is formed when the distance query sequence best and 2) interpreting pivotal func- d between a nonhydrogen atom of protein functional site tional sites and associated PLIs of the matched profile and a nonhydrogen atom of ligands satisfies the following modules as a prediction of PFSs and PLIs for the query inequality: protein based on certain statistical evaluations. In the first step, fiDPD scans the query sequence d < vdWðÞ A þ vdW A þ 0:5Å; Site Ligand against all its module entries using the SCAN module of the HMMER program [67]. The scan usually gives a where vdW(A) is the Van der Walls radius of atom A couple of profile modules within an alignment E-value − 5 and no covalent bond, coordination bond, hydrogen cutoff no greater than 1 × 10 . Each alignment (indexed bond, electrostatic force or π-stacking interaction is by superscript j in Eq. (1)) is assigned a scoring function found between them. A similar definition of the Van der E as the negative logarithm of the E-value score. Due to Waals interaction was also used by Kurgan and colleagues the limited volume of known protein sequences contained in their study of protein-small ligand interaction patterns in fiDPD, there are cases in which HMMER SCAN cannot [38] and by Ma and colleagues in their study of protein- find any match for the query protein, and for these cases, protein interactions [64]. The atomic Van der Waals radii fiDPD simply gives a notice of “no-hit.” In step 2), we de- were taken from the CHARMM22 force field [65]. Each fined a scoring function F for the ith residue of the query functional site was assigned an NVDW value as the total protein as its propensity to be a functional site: number of Van der Waals interactions involving atoms of j j j j this site. F ¼ S C N E ð1Þ 0 0 i i Covalent bond and coordinate bond where the summation runs over all the alignments j and i Usually, nonbonded forces dominate interactions between stands for the position of the profile module that matches a ligand and its target protein; however, irreversible cova- the ith residue of the query protein. Residues with a high- lent bonds are also found in PLIs when a tight and steady valued F-scoring function will be predicted as hypothetical connection between the ligand and receptor is essential to functional sites. the biological function, such as in the rhodopsin system One way to determine high-F-valued sites for a query [66]. A covalent bond is formed if the distance between a protein is to simply choose a certain number (n)of nonhydrogen atom from a functional site and a nonhy- top-valued residues, called n-top selection. This method drogen atom from ligand satisfies d < RðA Þþ RðA Þ Site Ligand has been used for enzyme catalytic site prediction [55] þ0:5 Å,where R(A) is the radius of atom A. For metal-ion since experimentally determined enzyme active sites ligands, this condition also defines coordinate bonds be- have a relatively fixed number as revealed by the Catalytic tween metal ions and PFSs. Usually, in coordinate bonds, Site Atlas (CSA) dataset [56]. Another method to select the shared electrons are present in atoms with higher elec- top-valued residues uses a cutoff percentage that was tronegativity in a functional site. We denoted NCOV as the proved to be efficient in a previous ligand-binding site total number of covalent bonds involving atoms in the func- prediction study [32, 34]. In this method, we first filtered tional site and NCOO as the total number of coordinate out those low-valued noise-like residues whose F-scores bonds involving atoms in that site. were smaller than a cutoff percentage M%ofthe max- We characterized a PLI between a PFS and the ligands imum F-value F ; then, for the remaining residues, max with a 7-dimensional interaction vector V = (NCOV, the top T% were predicted as hypothetical functional NCOO, NHBA, NHBD, NPI, NELE, NVDW). The inter- sites of the query protein. Usually, this selection strategy action vectors of all member proteins were summed in tends to give a greater prediction function for larger Han et al. BMC Bioinformatics (2018) 19:204 Page 6 of 12 proteins. We used this selection strategy to predict PFSs TP  TN−FP  FN MCC ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : in the remainder of this paper. The server is freely avail- ðÞ TP þ FP ∙ðÞ TP þ FN ∙ðÞ TN þ FP ∙ðÞ TN þ FN able and can be accessed at http://202.119.249.49.For clarity, F-scores are renormalized to a 1–100 range for The predicted PLIs were compared with those directly predicted sites. derived from 3D protein-ligand complex structures, and To predict PLIs, we defined a protein-ligand interaction precision and recall values were obtained to qualify PLI scoring-vector function I ={NCOV,NCOO,NHBA,NHBD , predictions. i i i i i NPI , NELE ,NVDW}for the ith residue of the query i i i protein following Eq. (1): Results and discussion X The mimivirus sulfhydryl oxidase R596 j j j j I ¼ N E C V ð2Þ 0 0 The 292aa mimivirus sulfhydryl oxidase R596 is target i i T0737 of CASP10, whose structure was later deter- j j j j j j where V ¼fNCOV ; NCOO ; NHBA ; NHBD ; NPI ; 0 0 0 0 0 0 i i i i i i mined at 2.21 Å (PDB entry 3TD7; see Fig. 3 [73]). The j j NELE ; NVDW g is the PLI vector for residue i in the 0 0 protein is composed of two all alpha-helix domains: the i i profile module j that matches the ith residue of the query N-terminal sulfhydryl oxidase domain (Erv domain) and sequence. For each prediction functional site, fiDPD will the C-terminal ORFan domain. The mimivirus enzyme determine an associated PLI vector according to Eq. (2), R596 has an EC number of EC1.8.3.2, catalyzing the for- which identifies the interactions involved with each pre- mation of disulfide bonds through an oxidation reaction diction site. For clarity, in the webserver, when I has a with the help of a cofactor of flavin adenine dinucleotide nonzero value from Eq. (2), it will be simply assigned as (FAD). FAD is tightly bonded to 22 residues in the cata- “1” to indicate a certain type of PLI. lytic pocket in the Erv domain [48], playing an important role in transferring electrons from a 10 Å distance shuttle Validation datasets disulfide in the flexible interdomain loop to the active-site The original fDPD was examined for PFS prediction using disulfide close to FAD in the Erv domain [73]. In the a few types of datasets, including two manually culti- prediction, fiDPD scanned the T0737 sequence against vated enzyme catalytic site datasets of the 140-enzyme thedatabaseand found4profilemoduleentries,all CATRES-FAM [68], the 94-enzyme Catalytic Site Atlas from the Apolipoprotein family with a structure of a (CSA-FAM) [56] and a 30-member small-molecular four-helical up-and-down bundle. The 4 entries include binding protein target from CSAP9 [69]. Here, we exam- an automated-match-domain profile built from 10 sequences ined fiDPD by calculating the PLIs of protein targets listed from Arabidopsis thaliana, a second automated-match- in CASP10 [70]and in CASP11 [49], whose ligand-binding domain profile built from 4 sequences from Rattus nor- complex structures had been solved. vegicus, an augmenter of liver regeneration domain profile built from 13 sequences from Rattus norvegicus, Validation method and a thiol-oxidase Erv2p domain profile built from 6 The conventional prediction precision and recall calcula- sequences from Saccharomyces cerevisiae. The scanning − 8 − 19 tions were used to evaluate the performance of our method: E-value ranges from 2 × 10 to 1 × 10 , indicating Precision = TP/(TP + FP) and Recall = TP/(TP + FN), where that the query sequence only has moderate similarity thetruepositives (TPs)are the predicted residues listed as with the annotated sequences in the database. A total functional sites in the dataset, the false positives (FPs) are of 56 annotated pivotal sites in the 4 fiDPD profile the predicted sites not listed in the dataset, and the false modules were then collected and sorted according to negatives (FNs) are the functional sites listed in the dataset their functional site scoring functions. When mapping but missed by the method. Another relevant quantity is the to the query sequence, 12 functional sites were then true negative (TN), which stands for the correctly predicted automatically identified, resulting in a 92% prediction nonbinding/nonfunctional site residues. In our calculations, precision and 57% recall. We also examined those func- the statistics did not take account of the “no-hit” predic- tional sites that fiDPD failed to identify and found that tions. The overall precision is the sum of all the TPs divided they are located in a different C-terminal domain than by the total number of predicted residues, and the overall the four-helical up-and-down bundle domain. recall is the sum of all the TPs divided by the total number To examine the PLI prediction, we first collected inter- of listed functional sites in the dataset. The precision-recall action scoring vectors associated with pivotal sites in the curve was found to be slightly dependent on the cutoff four profile modules according Eq. (2) and then compared percentage M% and T% in the selection method. The with those directly determined from the protein-ligand MCC [71] was used to assess the ligand-binding residue complex structure recorded in PDB entry 3TD7 (Table 1). predictions of the CASP10 target proteins [72]and is Figure 3 demonstrates key interactions predicted by defined as follows: Eq. (2) and those not found by the prediction. fiDPD Han et al. BMC Bioinformatics (2018) 19:204 Page 7 of 12 Fig. 3 Mapping the protein-ligand interactions predicted for the mimivirus sulfhydryl oxidase R596, target T0737, PDB code 3TD7. Dash lines represent PLIs, they are colored as following: blue for electrostatic interactions, green for π-stacking interactions, gray for van der Waals interactions, and red for interaction not found by fiDPD correctly predicted all the π-stacking interactions in- volving Trp45, His49, Tyr114, and His117, indicating that π-π interactions play a critically important role in ligand binding. The prediction also found significant π-stacking interactions on pivotal sites of Leu78 and Lys123; however, these π-π interaction predictions Table 1 The prediction of protein-ligand interactions on PFSs of were ignored in posttreatment simply because of the T0737† lack of aromatic side chains in these residues. fiDPD Target Site AA COV COO ELE HBD HBA π-π also found the correct electrostatic interactions on T0737 41 G 0 0 0 0 0 0 His117 and Lys123 sites. The algorithm identified a 42 T 0 0 +/0 T 0 0 large probability of electrostatic interactions on sites 45 W 0 0 0 T 0 T Thr42 and Val126; however, these interactions were ig- 49 H 0 0 0 0 + T nored in posttreatment since the involved residues are 78 L 0 0 0 0 0 0 not chargeable in the conventional conditions. In total, approximately 80% of the overall PLI predictions were 83 C 0 0 0 + T 0 associated with identified functional sites. 114 Y 0 0 0 0 T T 117 H 0 0 T + – T 118 N 0 0 0 + T 0 CASP10 and CASP11 targets 120 V 0 0 0 0 0 0 We applied fiDPD to protein targets listed in CASP10 121 N 0 0 0 0 + 0 and CASP11, of which 13 targets had been solved with explicit bound ligands [48]. Table 2 lists all the predic- 123 K 0 0 T T + +/0 tions, of which fiDPD gave a no-hit for 3 target proteins. †AA stands for amino acid, COV for covalent bond, COO for coordinate bond, ELE for electrostatic interaction, HBD for H-bond donor, HBA for H-bond For the remaining 10 predictions, fiDPD gave an overall acceptor, π-π for π-stacking interactions. “0” indicates the corresponding precision of 64% and an overall recall of 46% using a interaction is not present in protein-ligand complex structure and fiDPD calculation also showed no such type PLIs on the site scale selection with T of 45% and M of 35%. The Han et al. BMC Bioinformatics (2018) 19:204 Page 8 of 12 Table 2 Ligand-binding sites predictions of CASP10/11 targets proteins† Target PDB Ligand Type Sites* Prediction TP Precision Recall MCC T0652 4HG0 AMP Non-metal 11 17 6 0.35 0.55 0.41 T0657 2LUL ZN Metal 5 9 4 0.44 0.8 0.58 T0659 4ESN ZN Metal 3 No-hit T0675 2LV2 ZN Metal 8 9 8 0.89 1 0.94 T0686 4HQL MG Metal 5 6 3 0.5 0.6 0.54 T0696 4RT5 NA Metal 6 3 1 0.33 0.17 0.21 T0697 4RIT TRS Non-metal 6 11 0 0 0 0 T0706 4RCK MG Metal 5 3 3 1 0.6 0.77 T0720 4IC1 MN/SF4 Metal 14 No-hit T0721 4FK1 FAD Non-metal 29 3 3 1 0.1 0.31 T0726 4FGM ZN Metal 7 No-hit T0737 3TD7 FAD Non-metal 21 13 12 0.92 0.57 0.71 T0744 2YMV FNR Non-metal 19 4 4 1 0.21 0.45 † Target 762 to 854 were taken from CASP11 whose protein-ligand interactions were well characterized in the crystal structures *“Sites” is the number of ligand-binding sites recorded in PDB files of the target protein averaged MCC of the predictions was 0.49. Considering MCC of 0.57 for the studied target proteins. Six LIBRA the ligand-binding types, we found that fiDPD provided predictions were based on the known sites of the PDB better functional site predictions for metal binding sites structures of the target proteins themselves and contrib- with an average MCC value of 0.68, while it was 0.38 for uted a higher average MCC value of 0.80. For COACH, nonmetal binding site prediction, indicating that PFSs whose prediction is sequence based, the average MCC are more conservative with respect to either spatial ar- was 0.58, of which 2 predictions were based on the rangement or sequence location in metal binding. known sites of the target PDB structures. We observed We compared the performance of fiDPD with the re- that, except for T0675 and T0697, COACH had already cently published ligand-binding site prediction methods used the target PDB structures as templates in building LIBRA [74] (Table 3) and COACH [75, 76] (Table 4). structures from input target protein sequences. Taken LIBRA aligns the structures of input proteins with a col- together, COACH performed best, while fiDPD’s per- lection of known functional sites and gives an averaged formance (the present version of the database fiDPD Table 3 Prediction performance of LIBRA* Target PDB Length Sites LIBRA Rank-1 LIBRA Rank-2 Prediction TP Model MCC Prediction TP Model MCC T0652 4HG0 292 11 7 1 N 0.08 8 7 N 0.74 T0657 2LUL 154 5 4 4 Y 0.89 4 0 N 0 T0659 4ESN 72 3 3 3 Y 1 3 0 N 0 T0675 2LV2 74 8 4 4 Y 0.69 4 4 N 0.69 T0686 4HQL 242 5 3 3 Y 0.77 3 3 Y 0.77 T0696 4RT5 111 6 7 0 N 0 5 0 N 0 T0697 4RIT 483 6 14 0 N 0 5 0 N 0 T0706 4RCK 217 5 3 0 N 0 8 1 N 0.14 T0720 4IC1 202 8 4 4 Y 0.7 5 0 N 0 T0721 4FK1 301 29 24 23 N 0.86 23 2 N 0.01 T0726 4FGM 589 7 6 6 N 0.92 10 0 N 0 T0737 3TD7 292 21 10 10 N 0.67 6 0 N 0 T0744 2YMV 329 19 12 12 Y 0.78 2 2 Y 0.64 *LIBRA prediction was based on the input of the PDBs of the target proteins. “Sites” is the number of ligand-binding sites recorded in PDB files of the target protein. “Y” in “Model” indicates that the prediction was made based on binding pockets in the PDB of the target protein as the template. “N” when the PDB of the target protein was not used in prediction Han et al. BMC Bioinformatics (2018) 19:204 Page 9 of 12 Table 4 Prediction performance of COACH* Target PDB Length Sites COACH Rank-1 COACH Rank-2 Prediction TP Model MCC Prediction TP Model MCC T0652 4HG0 292 11 12 2 N 0.14 19 2 N 0.09 T0657 2LUL 154 5 7 0 N 0 5 5 Y 1 T0659 4ESN 72 3 3 3 N 1 8 0 N 0 T0675 2LV2 74 8 4 3 N 0.49 4 4 N 0.69 T0686 4HQL 242 5 4 3 N 0.66 13 0 N 0 T0696 4RT5 111 6 5 4 N 0.72 3 1 N 0.2 T0697 4RIT 483 6 12 0 N 0 5 0 N 0 T0706 4RCK 217 5 3 3 N 0.77 5 4 N 0.79 T0720 4IC1 202 8 5 4 Y 0.62 8 4 Y 0.48 T0721 4FK1 301 29 32 24 N 0.76 19 2 N 0.01 T0726 4FGM 589 7 10 6 N 0.71 10 3 N 0.35 T0737 3TD7 292 21 21 15 N 0.69 6 1 Y 0.05 T0744 2YMV 329 19 19 18 Y 0.94 7 4 N 0.32 *COACH built structures from the sequences of target proteins except for T0675 and T0697 by directly using the PDBs of the corresponding target proteins themselves. “Sites” is the number of ligand-binding sites recorded in PDB files of the target protein. “Y” in “Model” indicates that the prediction was made based on binding pockets in the PDB of the target protein as the template. “N” when the PDB of the target protein was not used in prediction does not contain target proteins except for T0675) was predicted PLIs and those calculated based on solved comparable with that of LIBRA, especially when known protein-ligand complex structures. Table 5 compared the sites of the target PDB structures were not used. predicted PLIs on functional sites with the experimental One of the key aspects of fiDPD predictions lies in the PLIs. In most cases, fiDPD can correctly identify 80% or identification of physicochemical interactions between more of the PLIs on functional sites. predicted binding sites and ligands. We examined the performance of the fiDPD prediction of PLIs in these Conclusions target proteins by determining the overlap between the In this paper, we present a new functional site- and physicochemical interaction-annotated domain profile database (fiDPD), from which we developed a sequence- Table 5 PLI predictions of CASP10/11 targets proteins† based method for predicting both PFSs and PLIs. Our Target Interactions Correct Prediction Recall method is based on the assumption that proteins that share T0652 60 36 60% similar structure and sequence tend to have similar func- T0657 24 23 95.80% tional sites located on the same positions on a protein’ssur- T0675 30 28 93.30% face. A profile module entry in fiDPD is representative of a bunch of annotated domain structures that share high se- T0686 18 17 94.40% quence and structure similarity. The fiDPD method first T0696 18 15 83.30% identifies profile modules in the database and then, as a T0697 104 72 69.20% prediction, maps the annotated pivotal sites and associated T0706 24 21 87.50% interactions of the module(s) to the residues of the query T0720 78 58 74.40% protein. T0721 60 50 83.30% In a previous study, we examined the fDPD method with a collection of catalytic sites from a standard dataset T0737 72 63 87.50% of the 140-enzyme CATRES-FAM [68] and found that the T0744 42 37 88.10% method provided an enzyme active-site prediction of 59% T0762 42 35 83.30% recall at a precision of 18.3%. For ligand-binding site pre- T0764 60 52 86.70% diction of target proteins in CASP9, the method obtained T0770 18 14 77.80% an averaged MCC of 0.56, ranking between 8th and 10th T0784 18 18 100% of the 33 participating groups [72]. In this study, fiDPD gives new prediction for physicochemical interactions T0854 24 20 83.30% associated with the predicted PFSs. Here, fiDPD was † Target 762 to 854 were taken from CASP11 whose protein-ligand interactions were well characterized in the crystal structures applied to predict the functional sites of 10 target Han et al. BMC Bioinformatics (2018) 19:204 Page 10 of 12 proteins in CASP10 and CASP11 that have been solved designed the webserver. DM wrote the paper. All authors read and approved the final manuscript. in a ligand-bound state and achieved an averaged MCC of 0.66. When compared with the solved 3D complex Ethics approval and consent to participate structures, we found that the predicted PLIs correctly Not applicable. overlapped 80% of the true PLIs. Our calculations indi- cate that the PLIs are well-conserved biochemical prop- Competing interests erties during protein evolution and that it is possible to The authors declare that they have no competing interests. assign accurate PLIs to predicted PFSs using an anno- tated database. fiDPD demonstrates that atomic physi- Publisher’sNote cochemical interactions between proteins and ligands Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. can be reliably identified from protein sequences. fiDPD is improvable. First, new annotations could be Author details assigned to fiDPD to add new types of predictions. For Department of Physiology and Biophysics, School of Life Science, Fudan University, Shanghai 200438, People’s Republic of China. College of example, adding annotations of enzyme catalytic sites Biotechnology and Pharmaceutical Engineering, Nanjing Tech University, (CSA), ligand-specific models, such as zinc-binding Biotech Building Room B1-404, 30 South Puzhu Road, Jiangsu 211816 sites or RNA-binding sites, should endow fiDPD with Nanjing, People’s Republic of China. the corresponding capability to predict catalytic sites, Received: 19 July 2017 Accepted: 15 May 2018 zinc-binding sites or RNA-binding sites. Annotations of fiDPD modules using other resources, such as dynamic simulations, FDPA calculations [32], pocket druggability References [77], drug-target interactions (DTIs), drug modes of action 1. Konc J, Janezic D. Binding site comparison for function prediction and pharmaceutical discovery. Curr Opin Struct Biol. 2014;25:34–9. [78], etc., should provide new content for fiDPD predic- 2. Perot S, Sperandio O, Miteva MA, Camproux AC, Villoutreix BO. Druggable tions that involve the protein dynamics and drug activity pockets and binding site centric chemical space: a paradigm shift in drug in PLIs. Second, considering that the classification of discovery. Drug Discov Today. 2010;15(15–16):656–67. 3. Xie L, Xie L, Bourne PE. Structure-based systems biology for analyzing off- binding sites plays a key role in drug discovery and design, target binding. Curr Opin Struct Biol. 2011;21(2):189–99. it would be interesting to use the clustering sites [79, 80] 4. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, instead of the intact SITE information to annotate the Bourne PE. The Protein Data Bank. Nucleic Acids Res. 2000;28(1):235–42. 5. Dukka BK. Structure-based methods for computational protein functional database, which might make the prediction more useful. site prediction. Computational and structural biotechnology journal. As a knowledge-based method, the utility and efficiency of 2013;8:e201308005. fiDPD prediction suffers from the sampling limitation of 6. Capra JA, Singh M. Characterization and prediction of residues determining protein functional specificity. Bioinformatics. 2008;24(13):1473–80. annotations of known proteins. This sampling problem 7. Manning JR, Jefferson ER, Barton GJ. The contrasting properties of conservation might be partially solved with large-scale protein sequen- and correlated phylogeny in protein functional residue prediction. BMC cing efforts and worldwide structural genomics projects. Bioinformatics. 2008;9:51. 8. Wilkins A, Erdin S, Lua R, Lichtarge O. Evolutionary trace for prediction and Abbreviations redesign of protein functional sites. Methods Mol Biol. 2012;819:29–42. CASP: Critical Assessment of Structure Prediction; FDPA: Fast dynamics 9. Fischer JD, Mayer CE, Soding J. Prediction of protein functional residues from perturbation analysis; fiDPD: Function-site- and physicochemical interaction- sequence by probability density estimation. Bioinformatics. 2008;24(5):613–20. annotated domain-profile-database; HMM: Hidden Markov Model; MCC: Matthews 10. Liang S, Zhang C, Liu S, Zhou Y. Protein binding site prediction using an correlation coefficient; MSA: Multiple sequence alignment; PFS: Protein empirical scoring function. Nucleic Acids Res. 2006;34(13):3698–707. functional site; PLI: Protein-ligand interaction; RMSD: Root-mean-square- 11. Chelliah V, Taylor WR. Functional site prediction selects correct protein distance; SCOPe: Structural classification of proteins—extended models. BMC Bioinformatics. 2008;9(Suppl 1):S13. 12. Berezin C, Glaser F, Rosenberg J, Paz I, Pupko T, Fariselli P, Casadio R, Ben-Tal N. Acknowledgements ConSeq: the identification of functionally and structurally important residues in This work began when one of the author (DM) visited CNLS in Los Alamos protein sequences. Bioinformatics. 2004;20(8):1322–4. National Laboratory. DM thanks Michael Wall for helpful discussions in early 13. Fetrow JS, Skolnick J. Method for prediction of protein function from days of this work. We also appreciated professor Rupu Zhao in Nanjing Tech sequence using the sequence-to-structure-to-function paradigm with University for helpful comments. application to glutaredoxins/thioredoxins and T1 ribonucleases. J Mol Biol. 1998;281(5):949–68. Funding 14. Gherardini PF, Helmer-Citterich M. Structure-based function prediction: This work was supported, in part, by the National Key Research and Development approaches and applications. Brief Funct Genomic Proteomic. 2008;7(4):291–302. Program of China for key technology of food safety (2017YFC1600900) and by the 15. Ausiello G, Via A, Helmer-Citterich M. Query3d: a new method for high- Key University Science Research Project of Jiangsu Province (Grant No. 17KJA180005). throughput analysis of functional residues in protein structures. BMC The funding body did neither contribute to the design of the study nor to Bioinformatics. 2005;6(Suppl 4):S5. collection, analysis and interpretation of the data nor to writing of the manuscript. 16. Barker JA, Thornton JM. An algorithm for constraint-based structural template matching: application to 3D templates with statistical analysis. Availability of data and materials Bioinformatics. 2003;19(13):1644–9. The method is freely available and can be accessed at: http://202.119.249.49. 17. Glaser F, Morris RJ, Najmanovich RJ, Laskowski RA, Thornton JM. A method for localizing ligand binding pockets in protein structures. Proteins. Authors’ contributions 2006;62(2):479–88. DM designed the work. DM and MH wrote the code of fiDPD program. MH 18. Brady GP Jr, Stouten PF. Fast prediction and visualization of protein binding performed the computational experiments and analyze the data. YS and JQ pockets with PASS. J Comput Aided Mol Des. 2000;14(4):383–401. Han et al. BMC Bioinformatics (2018) 19:204 Page 11 of 12 19. Tong W, Williams RJ, Wei Y, Murga LF, Ko J, Ondrechen MJ. Enhanced 47. Roche DB, Brackenridge DA, McGuffin LJ. Proteins and their interacting performance in prediction of protein active sites with THEMATICS and partners: an introduction to protein-ligand binding site prediction methods. support vector machines. Protein Sci. 2008;17(2):333–41. Int J Mol Sci. 2015;16(12):29829–42. 20. Elcock AH. Prediction of functionally important residues based solely on the 48. Gallo Cassarino T, Bordoli L, Schwede T. Assessment of ligand binding site computed energetics of protein structure. J Mol Biol. 2001;312(4):885–96. predictions in CASP10. Proteins. 2014;82(Suppl 2):154–63. 21. Kahraman A, Morris RJ, Laskowski RA, Favia AD, Thornton JM. On the 49. Moult J, Fidelis K, Kryshtafovych A, Schwede T, Tramontano A. Critical diversity of physicochemical environments experienced by identical ligands assessment of methods of protein structure prediction (CASP) - progress in binding pockets of unrelated proteins. Proteins. 2010;78(5):1120–36. and new directions in round XI. Proteins. 2016; 22. Coleman RG, Burr MA, Souvaine DL, Cheng AC. An intuitive approach to 50. Fox NK, Brenner SE, Chandonia JM. SCOPe: structural classification of measuring protein surface curvature. Proteins. 2005;61(4):1068–74. proteins–extended, integrating SCOP and ASTRAL data and classification of 23. Nayal M, Honig B. On the nature of cavities on protein surfaces: application new structures. Nucleic Acids Res. 2014;42(Database issue):D304–9. to the identification of drug-binding sites. Proteins. 2006;63(4):892–906. 51. Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP: a structural 24. Petrova NV, Wu CH. Prediction of catalytic residues using support vector classification of proteins database for the investigation of sequences and machine with selected protein sequence and structural properties. BMC structures. J Mol Biol. 1995;247(4):536–40. Bioinformatics. 2006;7:312. 52. Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam 25. Rossi A, Marti-Renom MA, Sali A. Localization of binding sites in protein H, Valentin F, Wallace IM, Wilm A, Lopez R, et al. Clustal W and Clustal X structures by optimization of a composite scoring function. Protein Sci. version 2.0. Bioinformatics. 2007;23(21):2947–8. 2006;15(10):2366–80. 53. Edgar RC. MUSCLE: a multiple sequence alignment method with reduced 26. Sankararaman S, Sha F, Kirsch JF, Jordan MI, Sjolander K. Active site prediction time and space complexity. BMC Bioinformatics. 2004;5:113. using evolutionary and structural information. Bioinformatics. 2010;26(5):617–24. 54. Eddy SR. A new generation of homology search tools based on 27. Somarowthu S, Yang H, Hildebrand DG, Ondrechen MJ. High-performance probabilistic inference. Genome Inform. 2009;23(1):205–11. prediction of functional residues in proteins with machine learning and 55. An XB, Wu XK, Ming DM. Sequece-based functional sites prediction computed input features. Biopolymers. 2011;95(6):390–400. from a function annotated protein domain profile database. J Fudan U. 28. Roche DB, Buenavista MT, McGuffin LJ. FunFOLDQA: a quality assessment 2013;52(6):768–78. tool for protein-ligand binding site residue predictions. PLoS One. 56. Porter CT, Bartlett GJ, Thornton JM. The catalytic site atlas: a resource of 2012;7(5):e38219. catalytic sites and residues identified in enzymes using structural data. 29. Ma B, Wolfson HJ, Nussinov R. Protein functional epitopes: hot spots, Nucleic Acids Res. 2004;32(Database issue):D129–33. dynamics and combinatorial libraries. Curr Opin Struct Biol. 2001;11(3):364–9. 57. Heazlewood JL, Durek P, Hummel J, Selbig J, Weckwerth W, Walther D, 30. Yang LW, Bahar I. Coupling between catalytic site and collective dynamics: Schulze WX. PhosPhAt: a database of phosphorylation sites in Arabidopsis a requirement for mechanochemical activity of enzymes. Structure. thaliana and a plant-specific phosphorylation site predictor. Nucleic Acids 2005;13(6):893–904. Res. 2008;36(Database issue):D1015–21. 58. McDonald IK, Thornton JM. Satisfying hydrogen bonding potential in 31. Liu T, Whitten ST, Hilser VJ. Functional residues serve a dominant role in proteins. J Mol Biol. 1994;238(5):777–93. mediating the cooperativity of the protein ensemble. Proc Natl Acad Sci U S A. 2007;104(11):4347–52. 59. Honig B, Nicholls A. Classical electrostatics in biology and chemistry. 32. Ming D, Cohn JD, Wall ME. Fast dynamics perturbation analysis for Science. 1995;268(5214):1144–9. prediction of protein functional sites. BMC Struct Biol. 2008;8:5. 60. Davis ME, Mccammon JA. Electrostatics in biomolecular structure and 33. Ming D, Wall ME. Quantifying allosteric effects in proteins. Proteins. dynamics. Chem Rev. 1990;90(3):509–21. 2005;59(4):697–707. 61. Dykstra CE. Electrostatic interaction potentials in molecular-force fields. 34. Ming D, Wall ME. Interactions in native binding sites cause a large change Chem Rev. 1993;93(7):2339–53. in protein dynamics. J Mol Biol. 2006;358(1):213–23. 62. Burley SK, Petsko GA. Aromatic-aromatic interaction: a mechanism of protein structure stabilization. Science. 1985;229(4708):23–8. 35. Fukunishi Y, Nakamura H. Prediction of ligand-binding sites of proteins by molecular docking calculation for a random ligand library. Protein Sci. 63. Muller-Dethlefs K, Hobza P. Noncovalent interactions: a challenge for 2011;20(1):95–106. experiment and theory. Chem Rev. 2000;100(1):143–68. 36. Heo L, Shin WH, Lee MS, Seok C. GalaxySite: ligand-binding-site prediction by 64. Ma B, Elkayam T, Wolfson H, Nussinov R. Protein-protein interactions: using molecular docking. Nucleic Acids Res. 2014;42(Web Server issue):W210–4. structurally conserved residues distinguish between binding sites and 37. Nerukh D, Okimoto N, Suenaga A, Taiji M. Ligand diffusion on protein exposed protein surfaces. Proc Natl Acad Sci U S A. 2003;100(10):5772–7. surface observed in molecular dynamics simulation. J Phys Chem Lett. 65. MacKerell AD, Bashford D, Bellott M, Dunbrack RL, Evanseck JD, Field MJ, Fischer S, 2012;3(23):3476–9. Gao J, Guo H, Ha S, et al. All-atom empirical potential for molecular modeling 38. Chen K, Kurgan L. Investigation of atomic level patterns in protein–small and dynamics studies of proteins. J Phys Chem B. 1998;102(18):3586–616. ligand interactions. PLoS One. 2009;4(2):e4473. 66. Matsuyama T, Yamashita T, Imai H, Shichida Y. Covalent bond between 39. Durrant JD, McCammon JA. BINANA: a novel algorithm for ligand-binding ligand and receptor required for efficient activation in rhodopsin. J Biol characterization. J Mol Graph Model. 2011;29(6):888–93. Chem. 2010;285(11):8114–21. 67. Eddy SR. Accelerated profile HMM searches. PLoS Comput Biol. 2011;7(10): 40. Kasahara K, Shirota M, Kinoshita K. Comprehensive classification and e1002195. diversity assessment of atomic contacts in protein-small ligand interactions. J Chem Inf Model. 2013;53(1):241–8. 68. Bartlett GJ, Porter CT, Borkakoti N, Thornton JM. Analysis of catalytic residues 41. Salentin S, Haupt VJ, Daminelli S, Schroeder M. Polypharmacology rescored: in enzyme active sites. J Mol Biol. 2002;324(1):105–21. protein-ligand interaction profiles for remote binding site similarity 69. Moult J, Fidelis K, Kryshtafovych A, Tramontano A. Critical assessment of methods assessment. Prog Biophys Mol Biol. 2014;116(2–3):174–86. of protein structure prediction (CASP)–round IX. Proteins. 2011;(79 Suppl):10:1–5. 42. Desaphy J, Raimbaud E, Ducrot P, Rognan D. Encoding protein-ligand 70. Moult J, Fidelis K, Kryshtafovych A, Schwede T, Tramontano A. Critical interaction patterns in fingerprints and graphs. J Chem Inf Model. 2013; assessment of methods of protein structure prediction (CASP)–round x. 53(3):623–37. Proteins. 2014;82(Suppl 2):1–6. 43. Wang SH, Wu YT, Kuo SC, Yu J. HotLig: a molecular surface-directed approach 71. Matthews BW. Comparison of the predicted and observed secondary to scoring protein-ligand interactions. J Chem Inf Model. 2013;53(8):2181–95. structure of T4 phage lysozyme. Biochim Biophys Acta. 1975;405(2):442–51. 44. Schreyer AM, Blundell TL: CREDO: a structural interactomics database for 72. Schmidt T, Haas J, Gallo Cassarino T, Schwede T. Assessment of ligand- drug discovery. Database (Oxford) 2013, 2013:bat049. binding residue predictions in CASP9. Proteins. 2011;79 Suppl 10:126–36. 45. Liu Z, Li Y, Han L, Li J, Liu J, Zhao Z, Nie W, Liu Y, Wang R. PDB-wide 73. Hakim M, Ezerina D, Alon A, Vonshak O, Fass D. Exploring ORFan domains in collection of binding data: current status of the PDBbind database. giant viruses: structure of mimivirus sulfhydryl oxidase R596. PLoS One. Bioinformatics. 2015;31(3):405–12. 2012;7(11):e50649. 46. Salentin S, Schreiber S, Haupt VJ, Adasme MF, SchroederM. PLIP: fully 74. Toti D, Le VH, Tortosa V, Brandi V, Polticelli F. LIBRA-WA: a web application automated protein-ligand interaction profiler. Nucleic Acids Res. for ligand binding site detection and protein function recognition. 2015;43(W1):W443–7. Bioinformatics. 2017; Han et al. BMC Bioinformatics (2018) 19:204 Page 12 of 12 75. Yang J, Roy A, Zhang Y. BioLiP: a semi-manually curated database for biologically relevant ligand-protein interactions. Nucleic Acids Res. 2013; 41(Database issue):D1096–103. 76. Yang J, Yan R, Roy A, Xu D, Poisson J, Zhang Y. The I-TASSER suite: protein structure and function prediction. Nat Methods. 2015;12(1):7–8. 77. Hussein HA, Borrel A, Geneix C, Petitjean M, Regad L, Camproux AC. PockDrug-server: a new web server for predicting pocket druggability on holo and apo proteins. Nucleic Acids Res. 2015;43(W1):W436–42. 78. Wang YH, Zeng JY. Predicting drug-target interactions using restricted Boltzmann machines. Bioinformatics. 2013;29(13):126–34. 79. Ivan G, Szabadka Z, Grolmusz V. A hybrid clustering of protein binding sites. FEBS J. 2010;277(6):1494–502. 80. Szabadka Z, Grolmusz V. Building a structured PDB: the RS-PDB database. Conf Proc IEEE Eng Med Biol Soc. 2006;1:5755–8.

Journal

BMC BioinformaticsSpringer Journals

Published: Jun 1, 2018

References

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off