Access the full text.
Sign up today, get DeepDyve free for 14 days.
JP Mei, CK Kwoh, P Yang, XL Li, J Zheng (2013)
Drug-target interaction prediction by learning from local information and neighborsBioinformatics, 29
L Breiman (2001)
Random forestsMach Learn, 45
X Xiao, JL Min, P Wang, KC Chou (2013)
igpcr-drug: A web server for predicting interaction between gpcrs and drugs in cellular networkingPLoS ONE, 8
G Jin, STC Wong (2014)
Toward better drug repositioning: prioritizing and integrating existing methods into efficient pipelinesDrug Discov Today, 19
D Arthur, S Vassilvitskii (2007)
Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms. SODA ’07
N Novac (2013)
Challenges and opportunities of drug repositioningTrends Pharmacol Sci, 34
TT Ashburn, KB Thor (2004)
Drug repositioning: identifying and developing new uses for existing drugsNat Rev Drug Discov, 3
X Chen, MX Liu, GY Yan (2012)
Drug-target interaction prediction by random walk on the heterogeneous networkMol BioSyst, 8
E Lim, A Pon, Y Djoumbou, C Knox, S Shrivastava, AC Guo, V Neveu, DS Wishart (2010)
T3db: a comprehensively annotated database of common toxins and their targetsNucleic Acids Res, 38
H He, EA Garcia (2009)
Learning from imbalanced dataIEEE Trans Knowl Data Eng, 21
SM Paul, DS Mytelka, CT Dunwiddie, CC Persinger, BH Munos, SR Lindborg, AL Schacht (2010)
How to improve r&d productivity: the pharmaceutical industry’s grand challengeNat Rev Drug Discov, 9
MJ Keiser, V Setola, JJ Irwin, C Laggner, AI Abbas, SJ Hufeisen, NH Jensen, MB Kuijer, RC Matos, TB Tran (2009)
Predicting new molecular targets for known drugsNature, 462
ZR Li, HH Lin, LY Han, L Jiang, X Chen, YZ Chen (2006)
Profeat: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequenceNucleic Acids Res, 34
H Yu, J Chen, X Xu, Y Li, H Zhao, Y Fang, X Li, W Zhou, W Wang, Y Wang (2012)
A systematic prediction of multiple drug-target interactions from chemical, genomic, and pharmacological dataPLoS ONE, 7
H Li, Z Gao, L Kang, H Zhang, K Yang, K Yu, X Luo, W Zhu, K Chen, J Shen, X Wang, H Jiang (2006)
Tarfisdock: a web server for identifying drug targets with docking approachNucleic Acids Res, 34
AM Wassermann, H Geppert, J Bajorath (2009)
Ligand prediction for orphan targets using support vector machines and various target-ligand kernels is dominated by nearest neighbor effectsJ Chem Inform Model, 49
F Cheng, C Liu, J Jiang, W Lu, W Li, G Liu, W Zhou, J Huang, Y Tang (2012)
Prediction of drug-target interactions and drug repositioning via network-based inferencePLoS Comput Biol, 8
L Nanni, A Lumini, S Brahnam (2014)
A set of descriptors for identifying the protein-drug interaction in cellular networkingJ Theor Biol, 359
DS Cao, S Liu, QS Xu, HM Lu, JH Huang, QN Hu, YZ Liang (2012)
Large-scale prediction of drug-target interactions using protein sequences and drug topological structuresAnalytica Chimica Acta, 752
Y Yamanishi, E Pauwels, H Saigo, V Stoven (2011)
Extracting sets of chemical substructures and protein domains governing drug-target interactionsJ Chem Inform Modeling, 51
K Bleakley, Y Yamanishi (2009)
Supervised prediction of drug–target interactions using bipartite local modelsBioinformatics, 25
DS Cao, N Xiao, QS Xu, AF Chen (2015)
Rcpi: R/bioconductor package to generate various descriptors of proteins, compounds and their interactionsBioinformatics, 31
L Xie, T Evangelidis, L Xie, PE Bourne (2011)
Drug discovery using chemical systems biology: Weak inhibition of multiple kinases may contribute to the anti-cancer effect of nelfinavirPLoS Comput Biol, 7
T Fawcett (2006)
An introduction to roc analysisPattern Recognit Lett, 27
M Kanehisa, S Goto, Y Sato, M Furumichi, M Tanabe (2012)
Kegg for integration and interpretation of large-scale molecular data setsNucleic Acids Res, 40
ZH Zhou (2012)
Ensemble methods: Foundations and algorithms
C Knox, V Law, T Jewison, P Liu, S Ly, A Frolkis, A Pon, K Banco, C Mak, V Neveu, Y Djoumbou, R Eisner, AC Guo, DS Wishart (2011)
Drugbank 3.0: a comprehensive resource for ‘omics’ research on drugsNucleic Acids Res, 39
Z Mousavian, A Masoudi-Nejad (2014)
Drug-target interaction prediction via chemogenomic space: learning-based methodsExpert Opinion Drug Metab Toxicol, 10
GM Weiss (2004)
Mining with rarity: A unifying frameworkSIGKDD Explor Newsl, 6
L Jacob, JP Vert (2008)
Protein-ligand interaction prediction: an improved chemogenomics approachBioinformatics, 24
A Ezzat, P Zhao, M Wu, X Li, CK Kwoh (2016)
Drug-target interaction prediction with graph regularized matrix factorizationIEEE/ACM Trans Comput Biol Bioinformatics, PP
J Li, S Zheng, B Chen, AJ Butte, SJ Swamidass, Z Lu (2016)
A survey of current trends in computational drug repositioningBrief Bioinformatics, 17
M Kuhn, D Szklarczyk, S Pletscher-Frankild, TH Blicher, C von Mering, LJ Jensen, P Bork (2014)
Stitch 4: integration of protein chemical interactions with user dataNucleic Acids Res, 42
Z He, J Zhang, XH Shi, LL Hu, X Kong, YD Cai, KC Chou (2010)
Predicting drug-target interaction networks based on functional groups and biological featuresPloS One, 5
T van Laarhoven, SB Nabuurs, E Marchiori (2011)
Gaussian interaction profile kernels for predicting drug–target interactionBioinformatics, 27
R Meir, G Rätsch (2003)
Advanced Lectures on Machine Learning: Machine Learning Summer School 2002 Canberra, Australia, February 11–22, 2002 Revised Lectures
M Gönen (2012)
Predicting drug–target interactions from chemical and genomic kernels using bayesian matrix factorizationBioinformatics, 28
Background: Multiple computational methods for predicting drug-target interactions have been developed to facilitate the drug discovery process. These methods use available data on known drug-target interactions to train classifiers with the purpose of predicting new undiscovered interactions. However, a key challenge regarding this data that has not yet been addressed by these methods, namely class imbalance, is potentially degrading the prediction performance. Class imbalance can be divided into two sub-problems. Firstly, the number of known interacting drug-target pairs is much smaller than that of non-interacting drug-target pairs. This imbalance ratio between interacting and non-interacting drug-target pairs is referred to as the between-class imbalance. Between-class imbalance degrades prediction performance due to the bias in prediction results towards the majority class (i.e. the non-interacting pairs), leading to more prediction errors in the minority class (i.e. the interacting pairs). Secondly, there are multiple types of drug-target interactions in the data with some types having relatively fewer members (or are less represented) than others. This variation in representation of the different interaction types leads to another kind of imbalance referred to as the within-class imbalance. In within-class imbalance, prediction results are biased towards the better represented interaction types, leading to more prediction errors in the less represented interaction types. Results: We propose an ensemble learning method that incorporates techniques to address the issues of between- class imbalance and within-class imbalance. Experiments show that the proposed method improves results over 4 state-of-the-art methods. In addition, we simulated cases for new drugs and targets to see how our method would perform in predicting their interactions. New drugs and targets are those for which no prior interactions are known. Our method displayed satisfactory prediction performance and was able to predict many of the interactions successfully. Conclusions: Our proposed method has improved the prediction performance over the existing work, thus proving the importance of addressing problems pertaining to class imbalance in the data. Keywords: Drug-target interaction prediction, Class imbalance, Between-class imbalance, Within-class imbalance, Small disjuncts, Ensemble learning Background discovery, many pharmaceutical companies resort to drug On average, it takes over a dozen years and around 1.8 repurposing or repositioning where drugs already on the billion dollars to develop a drug [1]. Moreover, most of market may be reused for novel disease treatments that the drugs being developed fail to reach the market due differ from their original objective and purpose [3]. to reasons pertaining to toxicity or low efficacy [2]. To Intuitively, repurposing a known drug to treat new dis- mitigate the risks and costs inherent in traditional drug eases is convenient and cost-effective for the following two reasons. Firstly, since the drug being repurposed is one that is already on the market (i.e. already approved by the *Correspondence: [email protected] Institute for Infocomm Research (I2R), A*Star, Fusionopolis Way, Singapore FDA), this implicitly means that it already passed clinical 138632, Singapore trials that ensure the drug is safe to use. Secondly, the drug Full list of author information is available at the end of the article being repurposed has already been studied extensively, © The Author(s). 2016 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. The Author(s) BMC Bioinformatics 2016, 17(Suppl 19):509 Page 268 of 295 so many of the drug’s properties (e.g. interaction profile, methods come in a variety of forms. Some are kernel- therapeutic or side effects, etc.) are known before initi- based methods that make use of information encoded in ating the drug repurposing effort. As such, drug repur- both drug and target similarity matrices to perform pre- posing helps facilitate and accelerate the research and dictions [16–21], while other chemogenomic methods use development process in the drug discovery pipeline [2]. graph-based techniques, such as random walk [22] and network diffusion [23]. Many data sources are publicly available online that support efforts in computational drug repositioning [4]. In this paper, we focus on a particular type of chemoge- Based on the types of data being used, different methods nomic methods, namely feature-based methods, where and procedures have been proposed to achieve drug repo- drugs and targets are represented with sets of descriptors sitioning [5]. In this paper, we particularly focus on global- (i.e. feature vectors). For example, He et al. represented scale drug-target interaction prediction;thatis, leveraging drugs and targets using common chemical functional information on known drug-target interactions, we aim to groups and pseudo amino acid composition, respectively predict or prioritize new previously unknown drug-target [24], while Yu et al. used molecular descriptors that were interactions to be further investigated and confirmed via calculated using the DRAGON package [25] and the experimental wet-lab methods later on. PROFEAT web server [26] for drugs and targets, respec- The main benefit of this technique for drug reposition- tively [27]. Other descriptors have also been used such as ing efforts is that, given a protein of interest (e.g. its gene position-specific scoring matrices [28], 2D molecular fin- is associated with a certain disease), many FDA-approved gerprints [29], MACCS fingerprints [30], and domain and drugs may simultaneously be computationally screened to PubChem fingerprints [31]. determine good candidates for binding [6]. As previously In general, many of the existing methods treat drug- mentioned, using an approved drug as a starting point in target interaction prediction as a binary classification drug development has desirable benefits regarding cost, problem where the positive class consists of interact- time and effort spent in developing the drug. In addition, ing drug-target pairs and the negative class consists of other benefits of this technique include the screening of non-interacting drug-target pairs. Clearly, there exists a potential off-targets that may cause undesired side effects, between-class (or inter-class) imbalance as the number of thus facilitating the detection of potential problems early the non-interacting drug-target pairs (or majority nega- tive class instances) far exceeds that of the interacting in the drug development process. Finally, new predicted drug-target pairs (or minority positive class instances). targets for a drug could improve our understanding of its actions and properties [7]. This results in biasing the existing prediction methods Efforts involving global-scale prediction of drug-target towards classifying instances into the majority class to interactions have been fueled by the availability of publicly minimize the classification errors [32]. Unfortunately, available online databases that store information on drugs minority class instances are the ones of interest to us. A and their interacting targets, such as KEGG [8], DrugBank common solution that was used in previous studies (e.g. [9], ChEMBL [10] and STITCH [11]. [27]) is to perform random sampling from the majority These efforts can be divided into three categories. The class until the number of sampled majority class instances first category is that of ligand-based methods where the matches that of the minority class instances. While this drug-target interactions are predicted based on the simi- considerably mitigates the bias problem, it inevitably leads larity between the target proteins’ ligands. A problem with to the discarding of useful information (from the majority this category of methods is that many target proteins have class) whose inclusion may lead to better predictions. little or no ligand information available, which limits the The other kind of class imbalance that also degrades applicability of these methods [12]. prediction performance, but has not been previously Docking simulation methods represent the second cat- addressed, is the within-class (or intra-class) imbalance egory of approaches for predicting drug-target interac- which takes place when rare cases are present in the data tions. Although they have been successfully used to pre- [33]. In our case, there are multiple different types of drug- dict drug-target interactions [13, 14], a limitation with target interactions in the positive class, but some of them these methods is that they require the 3D structures of are represented by relatively fewer members than others the proteins, which is a problem because not all proteins and can be considered as less well-represented interaction groups (also known as s have their 3D structures available. In fact, most membrane mall concepts or small disjuncts). proteins (which are popular drug targets) do not have If not processed well, they are a source of errors because resolved 3D structures, as determining their structures is predictions would be biased towards the well-represented a challenging task [15]. interaction types in the data and ignore these specific The third category is the chemogenomic approaches small concepts. which simultaneously utilize both the drug and tar- In this paper, we propose a simple method that get information to perform predictions. Chemogenomic addresses the two imbalance problems stated above. The Author(s) BMC Bioinformatics 2016, 17(Suppl 19):509 Page 269 of 295 Firstly, we provide a solution for the high imbalance ratio However, representing the targets this way is not suit- between the minority and majority classes while greatly able for machine learning algorithms because the length decreasing the amount of information discarded from of the sequence varies from one protein to another. To the majority class. Secondly, our method also deals with deal with this issue, an alternative to using the raw pro- the within-class imbalance prevalent in the data by bal- tein sequences is to compute (from these same sequences) ancing the ratios between the different concepts inside a number of different descriptors corresponding to var- the minority class. Particularly, we first perform clus- ious protein properties. The list of computed features is tering to detect homogenous groups where each group intended to be as comprehensive as possible so that it may, corresponds to one specific concept and the interactions as much as possible, convey all the information available within smaller groups are relatively easier to be incorrectly in the genomic sequences that they were computed from. classified. As such, we artificially enhance small groups Computing this list of features for each of the targets lets via oversampling, which essentially helps our classifica- them be represented using fixed-length feature vectors tion model focus on these small concepts to minimize that can be used as input to machine learning methods. classification errors. In our work, the target features were computed from their genomic sequences with the help of the PROFEAT [26] Data web server. This section provides our dataset information including The features that have been used to represent targets raw drug-target interaction data and the data representa- in this work are descriptors related to amino acid com- tion that turns each drug-target pair into its feature vector position; dipeptide composition; autocorrelation; compo- representation. sition, transition and distribution; quasi-sequence-order; amphiphilic pseudo-amino acid composition and total Drug-target interaction data amino acid properties. Note that a similar list of fea- The interaction data used in this study was collected tures was used previously in [27]. Subsets of these features recently from the DrugBank database [9] (version 4.3, have also been used in other previous studies concerning released on 17 Nov. 2015). Some statistics regarding the drug-target interaction prediction [24, 35]. More informa- collected interaction data are given in Table 1. In total, tion regarding the computed features can be accessed at there are 12674 drug-target interactions between 5877 the online documentation webpage of the PROFEAT web drugs and their 3348 protein interaction partners. The full server where all the features are described in detail. After generating features for drugs and targets, there lists of drugs and targets used in this study as well as the were features that had constant values among all drugs (or interaction data (i.e. which drugs interact with which tar- targets). Such features were removed as they would not gets) have been included as supplementary material [see Additional files 1, 2 and 3]. contribute to the prediction of drug-target interactions. Furthermore, there were other features that had missing Data representation values for some of the drugs (or targets). For each of these After having obtained the interaction data, we generated features, the missing values were replaced by the mean of features for the drugs and targets respectively. Particu- the feature over all drugs (or targets). In the end, 193 and larly, descriptors for drugs were calculated using the Rcpi 1290 features remained for drugs and targets, respectively. [34] package. Examples of drug features include consti- The full lists of drug features and target features used in tutional, topological and geometrical descriptors among this study have been included as supplementary material other molecular properties. Note that biotech drugs have [see Additional files 4 and 5]. been excluded from this study as Rcpi could only gener- Next, every drug-target pair is represented by feature ate such features for small-molecule drugs. The statistics vectors that are formed by concatenating the feature vec- given in Table 1 reflect our final dataset after the removal tors of the corresponding drug and target involved. For of these biotech drugs. example, a drug-target pair (d, t) is represented by the Now, we describe how target features were obtained. feature vector, Since it is generally assumed that the complete informa- tion of a target protein is encoded in its sequence [24], it [ d , d , ... , d , t , t , ... , t ], 1 2 193 1 2 1290 may be intuitive to represent targets by their sequences. where [ d , d , ... , d ] is the feature vector correspond- 1 2 193 ing to drug d,and[ t , t , ... , t ]isthefeaturevec- 1 2 1290 tor corresponding to target t. Hereafter, we also refer Table 1 Statistics of the interaction dataset used in this study to these drug-target pairs as instances. Finally, to avoid Drugs Targets Interactions potential feature bias in its original feature values, all fea- 5877 3348 12674 tures were normalized to the range [ 0, 1] using min-max The Author(s) BMC Bioinformatics 2016, 17(Suppl 19):509 Page 270 of 295 normalization before performing drug-target interaction each base learner, we randomly select two thirds of the prediction as follows features to represent the instances. Algorithm 1 shows our pseudocode for the overall archi- d − min(d ) i i ∀i = 1, ... , 193 , d = tecture of our proposed method where the specific steps max(d ) − min(d ) i i for handling the two imbalance issues are discussed in t − min(t ) j j the following subsections. Following is a summary of the ∀j = 1, ... , 1290 , t = . max(t ) − min(t ) j j method: The feature vectors that were computed for the drugs T decision trees are trained (T is a parameter), and targets have been included as supplementary material Prediction results of the T trees are aggregated by [see Additional files 6 and 7]. simple averaging to give the final prediction scores. For each decision tree, tree : Methods The proposed method was developed with an intention to 1. Randomly select a subset of the features, F . deal with two key imbalance issues, namely the between- 2. Obtain P by performing feature subspacing on P class imbalance and the within-class imbalance. Here, we using F . describe in detail how each of these imbalance issues was 3. Oversample P . handled. For notation, we use P to refer to the set of 4. Randomly sample N from N such that |N |=|P |. i i i positive instances (i.e. the known experimentally verified 5. Remove instances of N from N. drug-target interactions) and use N to refer to the remain- 6. Modify N by performing feature subspacing on it ing negative instances (consisting of all other drug-target using F . pairs that do not occur in P). 7. Train tree using the positive set P and the i i Technically speaking, these remaining instances should negative set N as the training set. be called unlabeled instances as they have not been exper- imentally verified to be true non-interactions. In fact, we Algorithm 1: Pseudocode of proposed method. believe that some of the instances in N are actually true Input: P = positive instances, drug-target interactions that have not been discovered N = negative instances, yet. Nevertheless, to simplify our discussion, we refer to T = number of base learners. them as negative instances since we assume the propor- tion of non-interactions in N to be quite high. Result: ensembleclassifier =trained ensemble. Our proposed algorithm begin We propose a simple ensemble learning method where for i ← 1 to T do the prediction results of the different base learners are F = randomly selected feature subset aggregated to produce the final prediction scores. For base P = P(F ) //feature subspacing i i learners, our ensemble method uses decision trees which are popularly used in ensemble methods (e.g. random for- //for within-class imbalance est [36]). Decision trees are known to be unstable learners, P = OVERSAMPLE(P ) i i meaning that their prediction results are easily perturbed by modifying the training set, making them a good fit with //for between-class imbalance ensemble methods which make use of the diversity in their repeat base learners to improve prediction performance [37]. Randomly sample N ∈ N It is generally known that an ensemble learning method until |N |=|P|; improves prediction performance over any of its con- N = N − N stituent base learners only if they are uncorrelated. Intu- itively, if the base learners of an ensemble method were N = N (F ) //feature subspacing i i i identical, then there would no gain in prediction perfor- tree = train decision tree using P and N i i i mance at all. As such, adding diversity to the base learners return ensemble = tree i=1 is important. One way of introducing diversity to the base learners that is used in our method is supplying each base learner with a different training set. Another way of adding diver- Within-class imbalance sity that we also employ here is feature subspacing;thatis, We are now ready to explain the OVERSAMPLE(P ) in for each of the base learners, we represent the instances Algorithm 1. As mentioned in the introduction section, using a different subset of the features. More precisely, for within-class imbalance refers to the presence of specific The Author(s) BMC Bioinformatics 2016, 17(Suppl 19):509 Page 271 of 295 types of interactions in the positive set P that are under- However, the data used in this study was obtained from represented in the data as compared to other interaction DrugBank [9], and since the data stored there is regu- types. Such cases are referred to as small concepts,and larly curated by experts, we have high confidence in the they are a source of errors because prediction algorithms interactions observed in our dataset. In other words, the are typically biased in that they favor the better repre- interactions (or positive instances) are quite reliable and sented interaction types in the data so as to achieve better are expected to contain little to no noise. On the other generalization performance on unseen data [33]. hand, the negative instances are expected to contain noise To deal with this issue, we use the K-means++ cluster- since, as mentioned earlier, these negative instances are ing method [38] to cluster the data into K homogenous actually unlabeled instances that likely contain interac- clusters (K is a parameter) where each cluster corre- tions that have not been discovered yet. Here, we only sponds to one specific concept. This results in interaction amplify the importance of small-concept data from the groups/clusters of different sizes. The assumption here is positive set (i.e. the set of known drug-target interactions). that the small clusters (i.e. those that contain few mem- Since the positive instances being emphasized are highly bers) correspond to the rare concepts (or small disjuncts) reliable, the potential impact of noise on the prediction that we are concerned about. Supposing that the size of performance is minimal. the biggest cluster is maxClusterSize, all clusters are re- sampled until their sizes are equal to maxClusterSize.This way, all concepts become represented by the same number Between-class imbalance of members and are consequently treated equally in train- Between-class imbalance refers to the bias in the predic- ing our classifier. Essentially, this is similar in spirit to the tion results towards the majority class, leading to errors idea of boosting [39] where examples that are incorrectly where minority examples are classified into the major- classified have their weights increased so that classifica- ity class. We wanted to ensure that predictions are not tion methods will focus on the hard-to-classify examples biased towards the majority class while, at the same time, to minimize the classification errors. decrease the amount of useful majority class information Algorithm 2 shows the pseudocode for the oversam- being discarded. To that end, a different set of nega- pling procedure. P is first clustered into K clusters of tive instances N is randomly sampled from N for each i i different sizes. After determining the size of the biggest of base learner i such that |N |=|P |. The 1:1 ratio of the i i these clusters, maxClusterSize, all clusters are re-sampled sizes of P and N eliminates the bias of the prediction i i until their sizes are equal to maxClusterSize.The re- results towards the majority class. Moreover, whenever a sampled clusters are then assigned to P before returning set of negative instances N is formed for a base learner, i i it to the main algorithm in the “Our proposed algorithm” its instances are excluded from consideration when we subsection. perform random sampling from N for future base learn- ers. The different non-overlapping negative sets that are formed for the base learners lead to better coverage of the Algorithm 2: Oversampling procedure. majority class in training the ensemble classifier. Input: P = positive instances. Note that, to improve coverage of the majority class Result: ensemble =trainedensemble. in training, the value of the parameter T needs to be increased where T is the number of base learners in the begin ensemble method, which also determines the number of Cluster P into K clusters: C ... C i 1 K the times that we want to draw instances from the nega- maxClusterSize =max size(C ) k k tive set N. In general, with the increase of the value of T, P = φ more useful information from the majority class will be for j ← 1 to K do incorporated to build our final classification model. repeat Re-sample C until size(C ) = maxClusterSize; Results and discussion P = P ∪ members(C ) i i j In this section, we have performed comprehensive experi- return P ments in which we compare our proposed technique with 4 existing methods. Below, we first elaborate on our exper- imental settings. Next, we provide details of our cross- An issue that we considered while implementing the validation experiments and comparison results. Finally, oversampling procedure was that of data noise. Indeed, we focus on predicting interactions for new drugs and new emphasizing small concept data can become a counter- targets, which is crucial for both novel drug design and drug repositioning tasks. productive strategy if there is much noise in the data. The Author(s) BMC Bioinformatics 2016, 17(Suppl 19):509 Page 272 of 295 Experimental settings base learner i such that |N |=|P |. Note that different i i To evaluate our proposed method, we conducted an base learners have used different negative sets in our pro- empirical comparison with 2 state-of-the-art methods and posed method. In addition, the parameters K and T for 2 baseline methods. Particularly, Random Forest and SVM our method were set to 100 and 500, respectively, to gen- are existing state-of-the-art methods that were both used erate sufficient homogenous clusters and leverage more in a recent work for predicting drug-target interactions negative data. [27]. Note that the parameters for these 2 methods were set to the default optimal values supplied in [27]. We also Cross validation experiments included two baseline methods, namely Decision Tree and To study the prediction performance of our proposed Nearest Neighbor. For Decision Tree, we employed the method, we performed a standard 5-fold cross validation fitctree built-in package in MATLAB and used the default and computed the AUC for each method (i.e. the area parameter values as they were found to produce reason- under the ROC curve). More precisely, for each of the able good results. As for Nearest Neighbor, it produces a methods being compared, 5 AUC scores were computed prediction score for every test instance a by computing (one for each fold) and then averaged to give the final over- its similarity to the nearest neighbor b from the minority all AUC score. Note that AUC is known to be insensitive class P (which contains the known interacting drug-target to skewed class distributions [40]. Considering that the pairs) based on the following equations, drug target interaction dataset used in this study is highly imbalanced (we have much more negatives than posi- score = max (sim(a, b)), b ∈ P a b tives), AUC score is thus a suitable metric for evaluation of the different computational methods. Figure 1shows theROC curvesforvarious methods.It ||a − b|| sim(a, b) = exp − , is obvious that the ROC curve for our proposed method |F| dominates those for the other methods, implying that it where |F| isthenumberoffeatures. has a higher AUC score. In particular, Table 2 shows the For the above 4 competing methods, they all used P AUC scores for different methods in details. Our pro- as the positive set, while the negative set was sampled posed method achieves an AUC of 0.900 and performs randomly from N until its size reached |P|.Incontrast, significantly better than other existing methods. our method oversampled P for each base learner i,giv- As shown in Table 2, the second best method is Ran- ing P , and a negative set N was sampled from N for each dom Forest. Moreover, our method is similar to Random i i Fig. 1 Plot of ROC curves of the different methods. ROC curves for the different methods are plotted together, providing a visual comparison between their prediction performances The Author(s) BMC Bioinformatics 2016, 17(Suppl 19):509 Page 273 of 295 Table 2 AUC Results of cross validation experiments Random Forest with an AUC of 0.855. This supports our claim that dealing with class imbalance in the data is Decision Tree 0.760 (0.004) important for improving the prediction performance. SVM 0.804 (0.004) Nearest Neighbor 0.814 (0.003) Random Forest 0.855 (0.006) Predicting interactions for new drugs and targets Proposed Method 0.900 (0.006) Ascenariothatmayoccurindrugdiscovery isthat we Standard deviations are included between parentheses. Best AUC is indicated may have a target protein of interest for which no infor- in bold mation on interacting drugs is available. This is typically a more challenging case than if we had information on Forest in that they are both ensembles of decision trees drugs that the target protein is already known to interact with feature subspacing. Both our proposed method and with. A similar scenario that occurs frequently in prac- Random Forest perform very well in drug-target inter- tice is that we have new compounds (potential drugs) for action prediction, showing that ensemble methods are which no interactions are known yet, and we want to indeed superior to achieve good prediction performance. determine candidate target proteins that they may inter- However, our method differs from Random Forest in two act with. When there is no interaction information on a perspectives. Firstly, Random Forest performs bagging on drug or target, they are referred to as a new drug or a new a single sampled negative set for each base learner, while target. our method leverages multiple non-overlapping negative To test the ability of our method to correctly pre- sets for different base learners. Secondly, our method also dict interactions in these challenging cases, we simulated oversamples the positive set in a way that is intended the cases of new drugs and targets by leaving them out to deal with the within-class imbalance, while Random of our dataset, training with the rest of the data and Forest does not. Due to these 2 differences, our method then obtaining predictions for these new drugs and new achieved an AUC of 0.900, which is 4.5% higher than targets. In our case studies, we ranked the predicted Table 3 Top 20 targets predicted for Aripiprazole and Theophylline Aripiprazole Theophylline Rank Target Rank Target 1 5-hydroxytryptamine receptor 2A 1 cAMP-specific 3’,5’-cyclic phosphodiesterase 4A 2 Alpha-1B adrenergic receptor 2 Histone deacetylase 2 3 Muscarinic acetylcholine receptor M2 3 Adenosine receptor A2a 4 5-hydroxytryptamine receptor 2C 4 Adenosine receptor A1 5 D(1) dopamine receptor 5 cGMP-inhibited 3’,5’-cyclic phosphodiesterase A 6 Alpha-2C adrenergic receptor 6 cAMP-specific 3’,5’-cyclic phosphodiesterase 4B 7 Histamine H1 receptor 7 Adenosine receptor A2b 8 Muscarinic acetylcholine receptor M3 8 cGMP-specific 3’,5’-cyclic phosphodiesterase 9 D(2) dopamine receptor 9 Adenosine receptor A3 10 Muscarinic acetylcholine receptor M1 10 Thymidylate synthase 11 5-hydroxytryptamine receptor 1B 11 Histone deacetylase 1 12 Delta-type opioid receptor 12 Cyclin-dependent kinase 2 13 D(4) dopamine receptor 13 Reverse transcriptase/RNaseH 14 D(3) dopamine receptor 14 Cap-specific mRNA (nucleoside-2’-O-)-methyltransferase 15 5-hydroxytryptamine receptor 1D 15 Multi-sensor signal transduction histidine kinase 16 Alpha-1 adrenergic receptor 16 Alpha-1 adrenergic receptor 17 Muscarinic acetylcholine receptor M5 17 Serine/threonine-protein kinase pim-1 18 Muscarinic acetylcholine receptor M4 18 Serine-protein kinase ATM 19 Alpha-2B adrenergic receptor 19 Proto-oncogene tyrosine-protein kinase Src 20 5-hydroxytryptamine receptor 1A 20 Phosphatidylinositol 4,5-bisphosphate 3-kinase catalytic subunit alpha isoform Targets in bold are the true known targets of the drugs The Author(s) BMC Bioinformatics 2016, 17(Suppl 19):509 Page 274 of 295 interactions and investigated the top 20 interactions. have also confirmed, using the STITCH online database In particular, we investigated two drugs, Aripiprazole [11], that Adenosine receptor A3 and Histone deacety- and Theophylline, and two targets, Glutamate receptor lase 1 are true targets of Theophylline as well. These ionotropic, kainate 2 and Xylose isomerase,respectively. findings suggest that the unconfirmed interactions in Tables 3 and 4 show the top 20 predictions for these drugs Tables 3 and 4 may be true interactions that have not been and targets. discovered yet. In our dataset, Aripiprazole and Theophylline are known to interact with 25 and 8 targets, respectively. Out of the Conclusion top 20 predicted targets for Aripiprazole, 19 were cor- We proposed a simple yet effective ensemble method for rectly predicted as shown in Table 3. For Theophylline,all predicting drug-target interactions. This method includes of its 8 interactions were highly ranked in its top 20 list. techniques for dealing with two types of class imbal- Moreover, Glutamate receptor ionotropic, kainate 2 and ance in the data, namely between-class imbalance and Xylose isomerase have 20 and 7 interacting drugs in our within-class imbalance. In our experiments, our method dataset. Out of the top 20 predicted drugs for Glutamate has demonstrated significantly better prediction perfor- receptor ionotropic, kainate 2, 17 were successfully pre- mance than that of the state-of-the-art methods via cross- dicted as shown in Table 4. For Xylose isomerase,all its validation. In addition, we simulated new drug and new 7 drugs were predicted in the top 20. These promising target predictioncasestoevaluateourmethod’sperfor- results show that our method is indeed reliable for pre- mance under such challenging scenarios. Our experimen- dicting interactions in the cases of new drugs or targets. tal results show that our proposed method was able to Finally, we investigated the possibility that some of highly rank true known interactions, indicating that it the unconfirmed interactions in Tables 3 and 4 might is reliable in predicting interactions for new compounds be true. For example, we observed that Delta-type opi- or previously untargeted proteins. This is particularly oid receptor is indeed a target for Aripiprazole, which important in practice for both identifying new drugs and was confirmed from the T3DB online database [41]. We detecting new targets for drug repositioning. Table 4 Top 20 drugs predicted for Glutamate receptor ionotropic, kainate 2 and Xylose isomerase Glutamate receptor ionotropic, kainate 2 Xylose isomerase Rank Drug Rank Drug 1 Metharbital 1 D-Xylitol 2 Butabarbital 2 alpha-D-Xylopyranose 3 Pentobarbital 3 L-Xylopyranose 4 Thiopental 4 beta-D-Ribopyranose 5 Butethal 5 D-Sorbitol 6 Secobarbital 6 D-Xylulose 7 Talbutal 7 Vitamin C 8 Hexobarbital 8 2-Methylpentane-1,2,4-Triol 9 Barbital 9 Tris-Hydroxymethyl-Methyl-Ammonium 10 Amobarbital 10 (4r)-2-Methylpentane-2,4-Diol 11 Phenobarbital 11 Ethanol 12 Butalbital 12 Beta-D-Glucose 13 Aprobarbital 13 D-Allopyranose 14 Methylphenobarbital 14 2-Deoxy-Beta-D-Galactose 15 Primidone 15 Tris 16 Lysine Nz-Carboxylic Acid 16 3-O-Methylfructose in Linear Form 17 Domoic Acid 17 Dithioerythritol 18 Heptabarbital 18 (2s,3s)-1,4-Dimercaptobutane-2,3-Diol 19 Vitamin A 19 1,4-Dithiothreitol 20 Mephenytoin 20 Glycerol Drugs in bold are true known drugs of the targets The Author(s) BMC Bioinformatics 2016, 17(Suppl 19):509 Page 275 of 295 Additional files References 1. Paul SM, Mytelka DS, Dunwiddie CT, Persinger CC, Munos BH, Lindborg SR, Schacht AL. How to improve r&d productivity: the pharmaceutical Additional file 1: Drug IDs. This file contains the DrugBank IDs of the industry’s grand challenge. Nat Rev Drug Discov. 2010;9(3):203–14. drugs used in this study. (46 kb TXT) doi:10.1038/nrd3078. 2. Novac N. Challenges and opportunities of drug repositioning. Trends Additional file 2: Target IDs. This file contains the UniProt IDs of the Pharmacol Sci. 2013;34(5):267–72. doi:10.1016/j.tips.2013.03.004. targets used in this study. (23 kb TXT) 3. Ashburn TT, Thor KB. Drug repositioning: identifying and developing Additional file 3: Drug-target interaction matrix. This file contains the new uses for existing drugs. Nat Rev Drug Discov. 2004;3(8):673–83. known drug-target interactions in the form of a matrix, where rows doi:10.1038/nrd1468. represent the drugs, and the columns represent the targets. Drug-target 4. Li J, Zheng S, Chen B, Butte AJ, Swamidass SJ, Lu Z. A survey of current pairs that interact have a 1 in their corresponding cell and 0 otherwise. trends in computational drug repositioning. Brief Bioinformatics. (37500 kb TXT) 2016;17(1):2–12. doi:10.1093/bib/bbv020. Additional file 4: List of drug features. This file contains the names of the 5. Jin G, Wong STC. Toward better drug repositioning: prioritizing and drug features used in this study. More details on the features can be found integrating existing methods into efficient pipelines. Drug Discov Today. at: http://bioconductor.org/packages/release/bioc/html/Rcpi.html 2014;19(5):637–44. doi:10.1016/j.drudis.2013.11.005. (1 kb TXT) 6. Xie L, Kinnings SL, Xie L, Bourne PE. Drug repositioning: Bringing new life to shelved assets and existing drugs. John Wiley & Sons, Inc. 2012. Additional file 5: List of target features. This file contains the names of the doi:10.1002/9781118274408. target features used in this study. More details on the features can be 7. Keiser MJ, Setola V, Irwin JJ, Laggner C, Abbas AI, Hufeisen SJ, Jensen found at: http://bidd2.nus.edu.sg/prof/manual/prof.htm (16 kb TXT) NH, Kuijer MB, Matos RC, Tran TB, et al. Predicting new molecular targets Additional file 6: Drug feature vectors. This file contains the feature for known drugs. Nature. 2009;462(7270):175–81. vectors for the drugs. (6180 kb TXT) 8. Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M. Kegg for Additional file 7: Target feature vectors. This file contains the feature integration and interpretation of large-scale molecular data sets. Nucleic vectors for the targets. (24400 kb TXT) Acids Res. 2012;40(D1):109–14. doi:10.1093/nar/gkr988. 9. Knox C, Law V, Jewison T, Liu P, Ly S, Frolkis A, Pon A, Banco K, Mak C, Neveu V, Djoumbou Y, Eisner R, Guo AC, Wishart DS. Drugbank 3.0: a Acknowledgements comprehensive resource for ‘omics’ research on drugs. Nucleic Acids Res. Not applicable. 2011;39(suppl 1):1035–41. doi:10.1093/nar/gkq1126. 10. Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A, Light Y, McGlinchey S, Michalovich D, Al-Lazikani B, Overington JP. Chembl: a Declarations large-scale bioactivity database for drug discovery. Nucleic Acids Res. This article has been published as part of BMC Bioinformatics Volume 17 2011. doi:10.1093/nar/gkr777. Supplement 19, 2016. 15th International Conference On Bioinformatics 11. Kuhn M, Szklarczyk D, Pletscher-Frankild S, Blicher TH, von Mering C, (INCOB 2016): bioinformatics. The full contents of the supplement are Jensen LJ, Bork P. Stitch 4: integration of protein chemical interactions available online https://bmcbioinformatics.biomedcentral.com/articles/ with user data. Nucleic Acids Res. 2014;42(D1):401–7. supplements/volume-17-supplement-19. doi:10.1093/nar/gkt1207. 12. Jacob L, Vert JP. Protein-ligand interaction prediction: an improved chemogenomics approach. Bioinformatics. 2008;24(19):2149–56. Funding doi:10.1093/bioinformatics/btn409. Publication of this article was funded by the Agency for Science, Technology 13. Li H, Gao Z, Kang L, Zhang H, Yang K, Yu K, Luo X, Zhu W, Chen K, and Research (A*STAR), Singapore. Shen J, Wang X, Jiang H. Tarfisdock: a web server for identifying drug targets with docking approach. Nucleic Acids Res. 2006;34(suppl 2): Availability of data and materials 219–24. doi:10.1093/nar/gkl114. The dataset supporting the conclusions of this article is included within the 14. Xie L, Evangelidis T, Xie L, Bourne PE. Drug discovery using chemical article (and its additional files). systems biology: Weak inhibition of multiple kinases may contribute to the anti-cancer effect of nelfinavir. PLoS Comput Biol. 2011;7(4):1–13. Authors’ contributions doi:10.1371/journal.pcbi.1002037. AE performed the data collection, the implementation of the proposed 15. Mousavian Z, Masoudi-Nejad A. Drug-target interaction prediction via method and the writing of this document. MW and X-LL assisted with the chemogenomic space: learning-based methods. Expert Opinion Drug design of the proposed method and provided useful feedback and discussion Metab Toxicol. 2014;10(9):1273–87. doi:10.1517/17425255.2014.950222. throughout the course of this work. C-KK assisted in the writing of this 16. van Laarhoven T, Nabuurs SB, Marchiori E. Gaussian interaction profile document and helped with enhancing the results and discussion sections of kernels for predicting drug–target interaction. Bioinformatics. this work. All authors read and approved the final manuscript. 2011;27(21):3036–43. doi:10.1093/bioinformatics/btr500. 17. Bleakley K, Yamanishi Y. Supervised prediction of drug–target Competing interests interactions using bipartite local models. Bioinformatics. 2009;25(18): The authors declare that they have no competing interests. 2397–403. doi:10.1093/bioinformatics/btp433. 18. Zheng X, Ding H, Mamitsuka H, Zhu S. Collaborative matrix factorization Consent for publication with multiple similarities for predicting drug-target interactions. In: Not applicable. Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM; 2013. p. 1025–1033. Ethics approval and consent to participate doi:10.1145/2487575.2487670. Not applicable. 19. Gönen M. Predicting drug–target interactions from chemical and genomic kernels using bayesian matrix factorization. Bioinformatics. Author details 2012;28(18):2304–10. doi:10.1093/bioinformatics/bts360. School of Computer Science & Engineering, Nanyang Technological 20. Ezzat A, Zhao P, Wu M, Li X, Kwoh CK. Drug-target interaction prediction University, Nanyang Ave., Singapore 639798, Singapore. Institute for with graph regularized matrix factorization. IEEE/ACM Trans Comput Biol Infocomm Research (I2R), A*Star, Fusionopolis Way, Singapore 138632, Bioinformatics. 2016;PP(99):1–1. doi:10.1109/TCBB.2016.2530062. Singapore. 21. Mei JP, Kwoh CK, Yang P, Li XL, Zheng J. Drug-target interaction prediction by learning from local information and neighbors. Published: 22 December 2016 Bioinformatics. 2013;29(2):238–45. doi:10.1093/bioinformatics/bts670. The Author(s) BMC Bioinformatics 2016, 17(Suppl 19):509 Page 276 of 295 22. Chen X, Liu MX, Yan GY. Drug-target interaction prediction by random walk on the heterogeneous network. Mol BioSyst. 2012;8:1970–8. doi:10.1039/C2MB00002D. 23. Cheng F, Liu C, Jiang J, Lu W, Li W, Liu G, Zhou W, Huang J, Tang Y. Prediction of drug-target interactions and drug repositioning via network-based inference. PLoS Comput Biol. 2012;8(5):1002503. doi:10.1371/journal.pcbi.1002503. 24. He Z, Zhang J, Shi XH, Hu LL, Kong X, Cai YD, Chou KC. Predicting drug- target interaction networks based on functional groups and biological features. PloS One. 2010;5(3):9603. doi:10.1371/journal.pone.0009603. 25. DRAGON. http://www.talete.mi.it/. Accessed Nov 2016. 26. Li ZR, Lin HH, Han LY, Jiang L, Chen X, Chen YZ. Profeat: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucleic Acids Res. 2006;34(suppl 2): 32–7. doi:10.1093/nar/gkl305. 27. Yu H, Chen J, Xu X, Li Y, Zhao H, Fang Y, Li X, Zhou W, Wang W, Wang Y. A systematic prediction of multiple drug-target interactions from chemical, genomic, and pharmacological data. PLoS ONE. 2012;7(5):1–14. doi:10.1371/journal.pone.0037608. 28. Nanni L, Lumini A, Brahnam S. A set of descriptors for identifying the protein-drug interaction in cellular networking. J Theor Biol. 2014;359: 120–8. doi:10.1016/j.jtbi.2014.06.008. 29. Xiao X, Min JL, Wang P, Chou KC. igpcr-drug: A web server for predicting interaction between gpcrs and drugs in cellular networking. PLoS ONE. 2013;8(8):1–10. doi:10.1371/journal.pone.0072234. 30. Cao DS, Liu S, Xu QS, Lu HM, Huang JH, Hu QN, Liang YZ. Large-scale prediction of drug-target interactions using protein sequences and drug topological structures. Analytica Chimica Acta. 2012;752:1–10. doi:10.1016/j.aca.2012.09.021. 31. Yamanishi Y, Pauwels E, Saigo H, Stoven V. Extracting sets of chemical substructures and protein domains governing drug-target interactions. J Chem Inform Modeling. 2011;51(5):1183–94. doi:10.1021/ci100476q. 32. He H, Garcia EA. Learning from imbalanced data. IEEE Trans Knowl Data Eng. 2009;21(9):1263–84. doi:10.1109/TKDE.2008.239. 33. Weiss GM. Mining with rarity: A unifying framework. SIGKDD Explor Newsl. 2004;6(1):7–19. doi:10.1145/1007730.1007734. 34. Cao DS, Xiao N, Xu QS, Chen AF. Rcpi: R/bioconductor package to generate various descriptors of proteins, compounds and their interactions. Bioinformatics. 2015;31(2):279–81. doi:10.1093/bioinformatics/btu624. 35. Wassermann AM, Geppert H, Bajorath J. Ligand prediction for orphan targets using support vector machines and various target-ligand kernels is dominated by nearest neighbor effects. J Chem Inform Model. 2009;49(10):2155–67. doi:10.1021/ci9002624. PMID: 19780576. 36. Breiman L. Random forests. Mach Learn. 2001;45(1):5–32. doi:10.1023/A:1010933404324. 37. Zhou ZH. Ensemble methods: Foundations and algorithms. Boca Raton: CRC Press; 2012. 38. Arthur D, Vassilvitskii S. K-means++: The advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms. SODA ’07. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics; 2007. p. 1027–1035. 39. Meir R, Rätsch G. An Introduction to Boosting and Leveraging In: Mendelson S, Smola AJ, editors. Advanced Lectures on Machine Learning: Machine Learning Summer School 2002 Canberra, Australia, February 11–22, 2002 Revised Lectures. Berlin, Heidelberg: Springer; 2003. Submit your next manuscript to BioMed Central p. 118–83. doi:10.1007/3-540-36434-X_4. 40. Fawcett T. An introduction to roc analysis. Pattern Recognit Lett. and we will help you at every step: 2006;27(8):861–74. doi:10.1016/j.patrec.2005.10.010. • We accept pre-submission inquiries 41. Lim E, Pon A, Djoumbou Y, Knox C, Shrivastava S, Guo AC, Neveu V, Wishart DS. T3db: a comprehensively annotated database of common � Our selector tool helps you to find the most relevant journal toxins and their targets. Nucleic Acids Res. 2010;38(suppl 1):781–6. � We provide round the clock customer support doi:10.1093/nar/gkp934. � Convenient online submission � Thorough peer review � Inclusion in PubMed and all major indexing services � Maximum visibility for your research Submit your manuscript at www.biomedcentral.com/submit
BMC Bioinformatics – Springer Journals
Published: Dec 22, 2016
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.