Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Development and evaluation of a deep learning model for protein–ligand binding affinity prediction

Development and evaluation of a deep learning model for protein–ligand binding affinity prediction Motivation: Structure based ligand discovery is one of the most successful approaches for aug- menting the drug discovery process. Currently, there is a notable shift towards machine learning (ML) methodologies to aid such procedures. Deep learning has recently gained considerable atten- tion as it allows the model to ‘learn’ to extract features that are relevant for the task at hand. Results: We have developed a novel deep neural network estimating the binding affinity of ligand– receptor complexes. The complex is represented with a 3D grid, and the model utilizes a 3D convo- lution to produce a feature map of this representation, treating the atoms of both proteins and ligands in the same manner. Our network was tested on the CASF-2013 ‘scoring power’ benchmark and Astex Diverse Set and outperformed classical scoring functions. Availability and implementation: The model, together with usage instructions and examples, is available as a git repository at http://gitlab.com/cheminfIBB/pafnucy. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online. 1 Introduction 2016; Ma et al., 2013). These methods are naturally capable of cap- Structure-based virtual screening techniques are some of the most turing non-linear and complex relationships in the available data. successful methods for augmenting the drug discovery process Rather than ‘manually’ creating rules using expert knowledge (Bajusz et al., 2017; Fradera and Babaoglu, 2017). With structure- and statistical inference, ML models use arbitrary functions with based screening, one tries to predict binding affinity (or more often, adjustable parameters that are capable of transforming the input (in a score related to it) between a target and a candidate molecule this scenario, a protein–ligand complex) to the output (a score based on a 3D structure of their complex. This allows to rank and related to protein–ligand binding affinity). Briefly, when the model prioritize molecules for further processing and subsequent testing. is presented with examples of input data paired with the desired out- Numerous scoring schemes have been developed to aid this process, come, it ‘learns’ to return predictions that are in agreement with the most of them use statistical and/or expert analysis of available pro- values provided. Typically the process of learning is incremental; by tein–ligand structures (Morris et al., 2009; Muegge, 2006; Verdonk introducing small changes to the model parameters, the prediction is et al., 2003). Currently, there is a notable shift towards scoring func- moved closer to the target value. Prime examples of ML scoring tions using machine learning (ML) methodologies, and this have functions are RF-Score (Ballester and Mitchell, 2010), which uses been highlighted by several reviews (Cheng et al., 2012; Lima et al., random forest, and NNscore (Durrant and McCammon, 2010, V The Author(s) 2018. Published by Oxford University Press. 3666 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Pafnucy 3667 2011), which uses an ensemble of shallow neural networks. These techniques to molecules encoded with SMILES strings (Jastrze ¸bski scoring functions were proven useful in virtual screening campaigns et al., 2016). and yielded more active compounds than their classical counterparts Application of deep models to de novo ligand design has also (Kinnings et al., 2011; Wo ´ jcikowski et al., 2017). been explored. There are several examples of models which use However, one drawback of such ML approaches is that they still autoencoders and/or recurrent neural networks that are able to pro- rely on feature engineering, i.e. they utilize expert knowledge to de- pose new molecules with desired properties (Ertl et al., 2017; fine rules that will become the basis of input data preprocessing. Go ´ mez-Bombarelli et al., 2017; Olivecrona et al., 2017; Segler et al., Hence, one can argue that they are just more sophisticated classical 2017). Using autoencoders also allows to represent molecules scoring functions with more complex rules. with short, real-valued vectors (extracted from the bottleneck layer), The ML rule of thumb says that in order to establish a good pre- which facilitates exploration of the chemical space (Go ´ mez- dictive model, the model needs a lot of data to be able to distinguish Bombarelli et al., 2017). more general trends and patterns from noise. The growing amount Although deep learning is more readily used in ligand-based of both structural data and affinity measurements has allowed regimes, there are currently a couple of interesting examples of researchers to explore deep learning. Briefly, a deep neural network structure-based neural networks. In AtomNet (Wallach et al., consists of multiple layers of non-linear transformations that extract 2015), input—molecular complex—is discretized to a 3D grid and and combine information from data to develop sophisticated rela- fed directly into a convolutional neural network. Instead of data pre- tionships between the input and the output. One of the main advan- processing, the model uses a learnable representation to recognize tages of deep learning is that it allows for the reduction of feature different groups of interacting atoms. AtomNet is a classification engineering: the model learns to extract features as a natural conse- method that yields 1 if the ligand is active and 0 otherwise. Another quence of the process of fitting the model’s parameters to the avail- similar model was created by Ragoza et al. (2016) and trained to able data. It is clear that choosing the representation of the input perform two independent classification tasks: activity and pose pre- data has a profound impact on the predictive power of a model. diction. However, with classification methods, we lose information Currently, there is a lot of effort in the field to incorporate feature about the strength of the interaction between the protein and the extraction directly into the ML model. In such an approach, a ligand. learnable molecule representation replaces classical descriptors and Since neural networks are also suitable for regression, Gomes fingerprints and becomes the first part of the model. Then, this rep- et al. (2017) created a model predicting the energy gap between a resentation is trained together with the predictive part of the model bounded protein–ligand complex and an unbounded state. In their to extract features that are useful in solving a specific task. With work, radial pooling filters with learnable mean and variance were such a design, it is therefore theoretically possible to find and quan- used to process the input. Such filters enabled the production of a tify relationships and/or mechanisms that have not yet been discov- summary of the atom’s environment and a representation that was ered or are unknown to the experts (Nketia et al., 2017; Zhang invariant to atom ordering and the orientation of the complex. et al., 2017). Taking into account the current findings and aforementioned Deep learning has been relatively widely used by the bioinfor- approaches, we have developed Pafnucy (pronounced ‘paphnusy’)— matics (Alipanahi et al., 2015; Jime ´ nez et al., 2017; Jurtz et al., a novel deep neural network tailored for many structure-based 2017; Leung et al., 2014; Park and Kellis, 2015) and computational approaches, including derivative prioritization and virtual screening. biology community (Angermueller et al., 2016). Several promising Similar to Ragoza et al. (2016), the input structure is represented examples of deep learning methods have also been shown for with a 3D grid, and a combination of convolutional and dense layers computer-aided drug design (CADD). In what follows, we first focus is used; however, our model tries to predict the exact binding affin- on ligand-based methods, which are in general more established, ity value. Pafnucy utilizes a more natural approach to atom descrip- and then we continue with structure-based models. tion in which both proteins and ligands have the same atom types. The simplest deep models in ligand-based design use molecular This approach serves as a regularization technique as it forces the fingerprints as feature vectors and fully-connected (dense) neural network to discover general properties of interactions between networks built on top of them. Such approaches were proven suc- proteins and ligands. Additionally, the design of Pafnucy provides cessful; they outperform other ML methods in predicting bioactivity insight into the feature importance and information extraction that (Lenselink et al., 2017) and other properties of small molecules, like is done during learning and the final prediction of binding affinity. aqueous solubility (Lusci et al., 2013) or toxicity (Xu et al., 2015). The network was implemented with TensorFlow (Abadi et al., Additionally, neural network model allows to easily create 2015) using Python API and trained on the PDBbind database (Liu multi-task classifiers or regressors, predicting, for example, activities et al., 2017). The source code, trained model and usage instructions against multiple targets at once. It has been shown that such QSAR are available as a git repository at http://gitlab.com/cheminfIBB/ models perform better than single-task networks (Dahl et al., 2014; pafnucy. Lenselink et al., 2017; Ma et al., 2015; Ramsundar et al., 2017; Xu et al., 2017), as they benefit from more training data, but also be- cause they are able to ‘share’ internal representations between tasks, 2 Materials and methods and therefore learn to recognize more general patterns in the data. As mentioned previously, neural networks allow for more flexi- 2.1 Data bility in terms of how the data are provided to the model, so that it 2.1.1 Representation of a molecular complex might learn the best representation for its purpose. One approach to Three-dimensional structures of protein–ligand complexes require achieve this is to use convolutions on the molecular graph, allowing specific transformations and encoding in order to be utilized by relevant patterns in this graph to be identified (Duvenaud et al., a neural network. In our approach, we cropped the complex to a 2015; Kearnes et al., 2016). Another way is to use a recurrent neural defined size of 20-A cubic box focused at the geometric center of a network on directed acyclic representations of the molecular graph ligand. We then discretized the positions of heavy atoms using a 3D (Lusci et al., 2013), or even apply natural-language processing grid with 1-A resolution (see Supplementary Fig. S1). This approach 3668 M.M.Stepniewska-Dziubinska et al. allowed for the representation of the input as a 4D tensor in which complexes) was used as an external test set, (iii) all other complexes each point is defined by Cartesian coordinates (the first 3 dimen- (remainder of the refined set and the general set, 11906 in total) sions of the tensor) and a vector of features (the last dimension). were used as the training set. In summary, the general and refined In Pafnucy, 19 features were used to describe an atom: sets were used to train the model and select the hyperparameters, while the core set was used as an external test set that was unknown 9 bits (one-hot or all null) encoding atom types: B, C, N, O, P, S, to the model during training and validation. The scheme illustrating Se, halogen and metal relationships between the subsets and dataset partitioning is avail- 1 integer (1, 2, or 3) with atom hybridization: hyb able in Supplementary Figure S2. 1 integer counting the numbers of bonds with other heavyatoms: Atomic features were calculated using Open Babel (O’Boyle heavy_valence et al., 2011), and the complexes were transformed into grids. Helper 1 integer counting the numbers of bonds with other heteroatoms: functions used to prepare the data and Jupyter Notebook with all hetero_valence preprocessing steps are available at http://gitlab.com/cheminfIBB/ 5 bits (1 if present) encoding properties defined with SMARTS pafnucy. patterns: hydrophobic, aromatic, acceptor, donor and ring As an additional external test set, we used 73 complexes from 1 float with partial charge: partialcharge the Astex Diverse Set (Hartshorn et al., 2007). This dataset, al- 1 integer (1 for ligand, -1 for protein) to distinguish between the though substantially smaller than PDBbind subsets, provides two molecules: moltype Pafnucy with structures from an independent source, and can help in detecting generalization problems related to database-specific The SMARTS patterns were defined the same way as in our pre- artefacts. Of 85 complexes in the Astex Diverse Set, we excluded vious project (Stepniewska-Dziubinska et al., 2017). The partial those without binding affinity (11 complexes) and those present in charges were scaled by the training set’s standard deviation in order the PDBbind database (a single complex, PDB ID: 1YVF, was pre- to get a distribution with a unit standard deviation, which improves sent in the general set). The remaining structures were prepared the learning. In case of collisions (multiple atoms in a single grid point), ˚ same way as the PDBbind database. This dataset was used in order which rarely occur for a 1-A grid, features from all colliding atoms to test Pafnucy on structures from a different source. were added. 2.1.2 Dataset preparation The network was trained and tested with protein–ligand complexes 2.2 Network from the PDBbind database v. 2016 (Liu et al., 2017). This database 2.2.1 Architecture consists of 3D structures of molecular complexes and their corre- The architecture used in Pafnucy is a deep convolutional neural net- sponding binding affinities expressed with pK (log K or log K ) work with a single output neuron for predicting the binding affinity. a d i values. PDBBind complexes were divided by Liu et al. into 3 over- The model consists of two parts: the convolutional and dense parts, lapping subsets. The general set includes all available data. From with different types of connections between layers (see Fig. 1). this set, the refined set, which comprises complexes with higher Convolution, from which the name ‘convolutional’ stems, is a math- quality, is subtracted. Finally, the complexes from the refined set are ematical operation that mixes two functions together. Most neural clustered by protein similarity, and 5 representative complexes are network libraries actually substitute the convolution operation with selected from each cluster. This fraction of the database is called the cross-correlation (Goodfellow et al., 2016), which has a more intui- core set and is designed as a high-quality benchmark for structure- tive interpretation and measures the similarity of two functions. The based CADD methods. model discovers patterns that are encoded by the filters in the convo- To properly employ PDBbind information and prevent data lutional layer and creates a feature map with spatial occurrences for leakage, we have split the data into disjoint subsets, i.e. the refined each pattern in the data. set was subtracted from the general set, and the core set was sub- Pafnucy’s input—molecular complex—is represented with a 4D tracted from the refined set so that there are no overlaps between tensor and treated like a 3D image with multiple color channels. the three subsets. Next, we have discarded all protein–protein, pro- Each position of an input (x, y and z coordinates) is described by a tein–nucleic acid, and nucleic acid–ligand complexes from these new vector of 19 properties (see Section 2.1.1), which is analogous to datasets. Finally, in order to evaluate our model with the CASF- how each pixel of an image (x and y coordinates) is described by a 2013 ‘scoring power’ benchmark (Li et al., 2014), we needed to ex- vector of intensities of three basic colors. clude all data that overlap with the 195 complexes used in CASF- First, the input is processed by a block of 3D convolutional 2013. We therefore excluded a total of 87 overlapping complexes (5 layers combined with a max pooling layer. Pafnucy uses 3 convolu- were part of the general set, and 82 were part of the refined set) tional layers with 64, 128 and 256 filters. Each layer has 5-A cubic from the training and validation sets. For the list of excluded struc- filters and is followed by a max pooling layer with a 2-A cubic tures see Supplementary Table S1. patch. The result of the last convolutional layer is flattened and used All complexes used in this study were protonated and charged as input for a block of dense (fully-connected) layers. We used 3 using UCSF Chimera (Pettersen et al., 2004) with Amber ff14SB for dense layers with 1000, 500 and 200 neurons. In order to improve standard residues and AM1-BCC for non-standard residues and generalization, dropout with drop probability of 0.5 was used for all ligands. No additional improvements nor calibration was performed dense layers. We also experimented with 0.2 dropout and no drop- on the complexes; this default protocol was chosen to be in line with out and achieved worse results on the validation set. (Li et al., 2014) to be able to compare Pafnucy to other methods Both convolutional and dense layers are composed of rectified tested on the CASF-2013 ‘scoring power’ benchmark. linear units (ReLU). ReLU was chosen because it speeds up the The remaining complexes of the PDBbind v. 2016 dataset were learning process compared with other types of activations. We also divided as follows: (i) 1000 randomly selected complexes from the experimented with Tanh units and achieved a very similar prediction refined set were used in validation, (ii) the whole core set (290 accuracy, but learning was much slower. Pafnucy 3669 in the dense layers were initialized with a truncated normal distribu- pffiffiffi tion with 0 mean and a standard deviation of 1= n, where n is the number of incoming neurons for a given layer. The corresponding biases were set to 1.0. The Adam optimizer was used to train the network with a 10 learning rate and 5 examples per mini-batch (The training set con- tains 11906 complexes; therefore, the last batch actually consisted of 6 complexes instead of 5.). Larger batch sizes (10 and 20 exam- ples) were also tested but resulted in worse performance. Training was carried out for 20 epochs, and the model with the lowest error on the validation set was selected (in the case of the network described in this work, it was after 14 epochs of training). To reduce overfitting, we used the dropout approach mentioned earlier and L2 weight decay with k ¼ 0:001. Using a higher value (k ¼ 0:01) decreased the model’s capacity too much and resulted in higher training and validation errors. In addition to providing regularization, L2 allows us to investigate feature importance. If a weight differs from 0 considerably, information it transfers must be important for the model to make a prediction (see Section 4). An important part of our approach was to develop a model that was not sensitive to ligand–receptor complex orientation. Therefore every structure was presented to the network in 24 different orienta- tions (i.e. all possible combinations of 90 rotations of a cubic box), yielding 24 different training examples per protein–ligand complex. By using systematic rotations of complexes during training, we anticipated that the network would learn more general rules about protein–ligand interactions and lead to better performance on new data. Indeed, in our experiments, we observed a much worse per- formance of models trained on single orientations regardless of the hyperparameters used to define a particular network. 3 Results The error on training and validation sets was monitored during learning (see Supplementary Fig. S3). Although the model was trained on 24 different rotations of each complex, the RMSE (root mean square error) was calculated for the original orientation only in order to speed up the computations. After 14 epochs of training, the model started to overfit, and the error on the validation set started to slowly yet steadily increase. The best set of weights of the network, obtained after 14 epochs of training, was saved and used as the final model. Model performance was evaluated on all subsets of the data (see Table 1 and Fig. 2). For each complex in the dataset, affinity was predicted and compared to the real value. Prediction error was measured with RMSE and MAE (mean absolute error). The correlation between the scores and ex- perimentally measured binding constants was assessed with the Pearson’s correlation coefficient (R) and the standard deviation in regression (SD). SD is a measure used in CASF (Li et al., 2014) and is defined as follows: vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u X t 2 SD ¼ ½t ðÞ ay þ b i i N  1 Fig. 1. Pafnucy’s architecture. The molecular complex is represented with a i¼1 4D tensor, processed by threee convolutional layers and three dense (fully- connected) layers to predict the binding affinity where t and y are the measured and predicted affinities for the ith i i complex, whereas a and b are the slope and the intercept of the re- 2.2.2 Training gression line between measured and predicted values, respectively. The initial values of the convolutional filter weights were drawn As expected, the network achieves the lowest error on the train- from a truncated normal distribution with 0 mean and 0.001 stand- ing set (Fig. 2c), which was used to find the weights of the network. ard deviation and corresponding biases were set to 0.1. The weights More importantly, Pafnucy also returns accurate predictions for the 3670 M.M.Stepniewska-Dziubinska et al. two test sets (Fig. 2a and b), which were unknown to the model dur- We also compared Pafnucy to X-Score on the Astex Diverse Set ing training and validation. The results on the CASF-2013 ‘scoring (Table 3). This experiment provides Pafnucy with a test set com- power’ benchmark (PDBbind v. 2013 core set), although substan- pletely separate from the data provided by Liu et al. tially worse than for other subsets, are still better than those for any Both methods have comparable errors to those obtained on the other scoring function tested by Li et al. (2014)—the best- PDBbind data. As expected, Pafnucy outperforms X-Score on the performing X-Score had R ¼ 0.61 and SD ¼ 1.78, while our model Astex Diverse Set, regardless of which measure is used. The observed achieved R ¼ 0.70 and SD ¼ 1.61 (see Table 2). To our knowledge, correlation, however, is lower for both methods. This effect is par- the only scoring function with better performance published so far is tially due to the fact that the Astex dataset contains only 73 com- RF-Score v3, which achieves R ¼ 0.74 and SD ¼ 1.51 on CASF-2013 plexes, and therefore, correlation is much more sensitive to small [results were calculated with ODDT (Wo ´ jcikowski et al., 2015)]. changes in the predictions than for bigger subsets. Table 1. Pafnucy’s performance 4 Discussion Dataset RMSE MAE SD R 4.1 stability of the results with respect to input rotation Test (v. 2016 core set) 1.42 1.13 1.37 0.78 One of the biggest challenges of this project was to properly handle Validation 1.44 1.14 1.43 0.72 the orientation of a molecular complex. This problem occurs also in Training 1.21 0.95 1.19 0.77 image recognition—the input looks differently when an object is shown from a different angle, yet it contains the same information Note: Prediction accuracy for each subset was evaluated using four differ- ent metrics (see main text). about the underlying real object. There are two main approaches to (a)(b) (c)(d) Fig. 2. Predictions for two test sets (core sets from PDBbind v. 2016 and v. 2013), training set and validation set Pafnucy 3671 Table 2. Results on the CASF-2013 ‘scoring power’ benchmark (PDBbind v. 2013 core set) a b c a Pafnucy X-Score ChemScore ChemPLP PLP1 G-Score SD 1.61 1.78 1.82 1.84 1.86 1.87 R 0.70 0.61 0.59 0.58 0.57 0.56 Note: Only the five best performing scoring functions are presented, for full results see (Li et al., 2014). SYBYL. GOLD. Discovery Studio. Table 3. Predictions accuracy on the Astex Diverse Set Method RMSE MAE SD R Pafnucy 1.43 1.13 1.43 0.57 X-Score 1.55 1.22 1.48 0.52 Fig. 3. Range of weights for each input channel (feature). Outliers are not shown dealing with this issue: by using a representation invariant to the complex position (e.g. molecule-level features or internal coordi- nates), or by creating a model that would be robust to the input details). During training, the weights tend to spread and form wider orientation. The model presented in this work uses the latter ap- ranges, as weights with higher absolute values pass more informa- proach, similarly to how 2D convolutional neural networks are used tion to the deeper layers of the network. Because Pafnucy was in image recognition. Therefore, to generalize well, the model trained with L2 regularization, only crucial weights were likely to needed to learn to extract information from differently presented in- have such high absolute values. put. In order to achieve this, we augmented the dataset with system- The input was represented using 19 channels, some of which atic rotations of the input data. If Pafnucy was trained correctly, it were expected to be of low relevance for the model (e.g. the boron should return similar predictions regardless of the orientation of the atom type). As we can see in the Figure 3, the feature with the widest complex. range is the moltype—feature distinguishing the protein from the lig- To test the model’s stability we selected the PDE10A protein, a and. This result implies that Pafnucy learned that binding affinity cAMP/cGMP phosphodiesterase important in signal transduction depends on the relationship between the two molecules and that rec- and recently linked to neuropsychiatric disorders (MacMullen et al., ognizing them is crucial. Additionally, the weights for selenium and 2017). PDE10A is complexed with 57 different ligands in the boron atom types (Se and B, respectively) barely changed during PDBBind database (41 complexes in the training set, 6 in the valid- training and are close to 0. This result can be interpreted in two ation and 10 in the test set). Each of the complexes was presented to ways: either the network found other features of protein–ligand the model in 24 different rotations, and the distribution of returned complexes more important for binding affinity, or due to infrequent predictions was analyzed. As anticipated, the variability of the pre- occurrence of these atom types in ligands the network was not able dicted binding constants is low (see Supplementary Fig. S4). to find any general patterns for their influence on binding affinity. Additionally, the variability does not depend on the value of the pre- To further inspect how the network utilizes the input, we ana- diction nor the subset the molecule belongs to. lyzed the impact of missing data on the prediction. To inspect this, we selected one of the PDE10A complexes with a benzimidazole inhibitor (complex PDB ID: 3WS8; ligand PDB ID: X4C). The ex- 4.2 How Pafnucy sees and processes the data periment was carried out as follows: we produced 343 corrupted Neural networks are often deemed harder to analyze and interpret complexes with some missing data and predicted the binding affinity than simpler models and are sometimes regarded as ‘black-boxes’. for each. The missing data were produced by deleting a 5-A cubic The worry is that a model can yield good predictions for the wrong box from the original data. We slid the box with a 3-A step (in every reasons (e.g. artefacts hidden in the data) and therefore will not gen- direction), thus yielding 7 ¼ 343 corrupted inputs. Next, we eralize well for new datasets. In order to trust a neural network and rotated the complex by 180 about the X-axis and followed the its predictions, one needs to ensure that the model uses information same procedure, thus yielding another 343 corrupted inputs. Then, that is relevant to the task at hand. In this section, we analyze which for each of the two orientations, we took 10 corrupted inputs that parts of the input are the most important and have the biggest im- had the highest drop in predicted affinity (Fig. 4). We wanted to find pact on the predictions. which atoms’ absence caused the highest drops in the predictions. In the case of random forests, for example, there is an established As we can see in Figure 4a and b, for both orientations, we iden- way to calculate feature importance based on the impurity decrease tified the same region containing the ligand and its nearest neigh- (Breiman et al., 1984). With neural networks, there is no such con- bourhood. The boxes contain the amino-acids participating in the sensus, as the interpretation of the model’s parameters may differ interactions with the ligand, i.e. Gln726, which forms a hydrogen considerably between networks with different architectures. bond, and Phe729, which forms a p  p interaction with the ligand In the case of Pafnucy, which was trained with L2, we can esti- (Fig. 4c). mate feature importance by looking at the distributions of weights Additionally, if we considered 15 corrupted complexes with the associated with the convolutional filters in the first hidden layer. highest drop in predictions, we find other amino-acids interacting Their initial values were close to 0 (see Section 2.2.2 for more 3672 M.M.Stepniewska-Dziubinska et al. Fig. 5. Activations on the hidden layers for two orientations of the PDE10A complex (PDB ID: 3WS8). Darker colors indicate higher values. Cosine distan- ces (d) between the activation patterns for each layer are provided different orientations of the complex (the second rotated about the X-axis by 180 ). For this inquiry, we analyzed the activations of the hidden layers for the two inputs. In Figure 5, we can see that the first hidden layer has very differ- ent activation patterns for the two orientations of the input. Pafnucy gets very different data and needs to use different filters in the first convolutional layer to process them. However, the closer we get to the output layer, the more similar the activations become. We can clearly see that our model learned to extract the same information from differently presented data. 5 Conclusions In this work we presented a deep neural network, Pafnucy, which can be used in structure based ligand discovery campaigns; as a scor- ing function in virtual screening or affinity predictor for novel mole- cules after a complex is generated. Pafnucy can be also utilized directly during the docking procedure to guide ligand pose optimiza- tion. The model was tested on the CASF-2013 ‘scoring power’ benchmark and outperformed all 20 state-of-the-art scoring func- Fig. 4. The most important parts of the input. Regardless of the complex tions tested by the CASF-2013 authors. The results obtained and the orientation, the same region of the input had the highest impact on the pre- careful analysis of the network show that Pafnucy makes reliable diction. Note that the second plot is rotated back about the X-axis to ease the predictions based on relevant features. comparison. (a) Original orientation. (b) Rotated by 180 about the X-axis. Predicting the impact of small molecules on diverse biologically im- (c) Protein–ligand interactions. Graphic was generated with Poseview portant protein targets has long been sought by researchers. Pafnucy (Stierand and Rarey, 2010) can be either applied to test multiple compounds against a single pro- with the ligand: Tyr693, which forms a hydrogen bond, and tein, or to test multiple proteins against a single compound. It can there- Met713, which forms hydrophobic contacts with the ligand. The fore help in discovering new potential drugs, but also in investigating methodology presented above can be applied to other complexes in side effects of bioactive molecules. By anticipating the potential impact order to elucidate specific ligand–receptor interactions with the of new drugs on the biology of the cell, Pafnucy may contribute to such most profound effect on the prediction. disciplines as systems medicine and systems biology. Going back to the uncorrupted input, we wanted to investigate The approaches used in the analyses presented in the how Pafnucy managed to give almost identical predictions for two ‘Discussion’ are general and can be applied to other predictive Pafnucy 3673 Durrant,J.D. and McCammon,J.A. (2011) NNScore 2.0: a neural-network models for drug discovery. Finding the most important features and receptor–ligand scoring function. J. Chem. Inf. Model., 51, 2897–2903. parts of a molecular complex can help researchers to design better Duvenaud,D.K. et al. (2015) Convolutional networks on graphs for learning compounds, but also to better understand and improve their molecular fingerprints. In: Cortes,C. (ed.) Neural Information Processing models. Systems, pp. 2215–2223. Because Pafnucy is a neural network, it is also possible to calcu- Ertl,P. et al. (2017) In silico generation of novel, drug-like chemical matter late and analyze its gradients. During training, gradients are used to using the lstm neural network. arXiv preprint arXiv: 1712.07449. optimize model parameters. However, they can also be calculated Fradera,X. and Babaoglu,K. (2017) Overview of methods and strategies for for the input and point to beneficial changes in a molecule’s con- conducting virtual small molecule screening. Curr. Protoc. Chem. Biol., 9, formation (finding optimal pose during docking) or composition 196–212. (lead optimization). Gomes,J. et al. (2017) Atomic convolutional networks for predicting protein–- ligand binding affinity. arXiv preprint arXiv: 1703.10603. Pafnucy and its source code, together with the Jupyter Notebooks Go ´ mez-Bombarelli,R. et al. (2017) Automatic chemical design using a used to prepare the data and analyze the results, are freely available at data-driven continuous representation of molecules. ACS Cent. Sci., 4, http://gitlab.com/cheminfIBB/pafnucy. Usage examples and scripts are 268–276. also available to facilitate the most common use-cases: preparing the Goodfellow,I. et al. (2016) Deep Learning. MIT Press, Cambridge, MA. input data, predicting binding affinity and training a new network. Hartshorn,M.J. et al. (2007) Diverse, high-quality test set for the validation of We hope that these features will make Pafnucy easily applicable and protein–ligand docking performance. J. Med. Chem., 50, 726–741. adaptable by other researchers. In addition, we are working on a Jastrze¸bski,S. et al. (2016). Learning to SMILE(S). In: International more flexible implementation of the model, that will allow the user to Conference on Learning Representation 2016 (Workshop track). easily manipulate network parameters and molecular complex repre- Jime ´ nez,J. et al. (2017) Deepsite: protein-binding site predictor using 3d-con- sentation, with minimal programming knowledge. volutional neural networks. Bioinformatics, 33, 3036–3042. Jurtz,V.I. et al. (2017) An introduction to deep learning on biological sequence Preparing the environment with all needed dependencies and data: examples and solutions. Bioinformatics, 33, 3685–3690. using the model for the new data can be done with minimum effort: Kearnes,S. et al. (2016) Molecular graph convolutions: moving beyond finger- git clone https://gitlab.com/cheminfIBB/pafnucy prints. J. Comput. Aided Mol. Des., 30, 595–608. Kinnings,S.L. et al. (2011) A machine learning-based method to improve dock- cd pafnucy ing scoring functions and its application to drug repurposing. J. Chem. Inf. conda env create -f environment_gpu.yml Model., 51, 408–419. source activate pafnucy_env Lenselink,E.B. et al. (2017) Beyond the hype: deep neural networks outper- python prepare.py -l ligand.mol2 -p pocket.mol2 -o data.hdf form established methods using a ChEMBL bioactivity benchmark set. J. Cheminf., 9, 45. python predict.py -i data.hdf -o predictions.csv Leung,M.K.K. et al. (2014) Deep learning of the tissue-regulated splicing code. Bioinformatics, 30, i121–i129. Li,Y. et al. (2014) Comparative assessment of scoring functions on an updated Acknowledgements benchmark: 2. evaluation methods and general results. J. Chem. Inf. Model., 54, 1717–1736. The authors thank Maciej Wo ´ jcikowski for providing the data and X-Score Lima,A.N. et al. (2016) Use of machine learning approaches for novel drug results for the Astex Diverse Set and Maciej Dziubinski  for his help in revising discovery. Exp. Opin. Drug Discov., 11, 225–239. the manuscript. Liu,Z. et al. (2017) Forging the basis for developing protein–ligand interaction scoring functions. Accounts Chem. Res., 50, 302–309. Lusci,A. et al. (2013) Deep architectures and deep learning in chemoinfor- Funding matics: the prediction of aqueous solubility for drug-like molecules. This work was supported by the Polish Ministry of Science and Higher J. Chem. Inf. Model., 53, 1563–1575. Education (POIG.02.03.00-00-003/09-00 and IP2010 037470). Ma,D.-L. et al. (2013) Drug repositioning by structure-based virtual screening. Chem. Soc. Rev., 42, 2130–2141. Ma,J. et al. (2015) Deep neural nets as a method for quantitative References structure-activity relationships. J. Chem. Inf. Model., 55, 263–274. MacMullen,C.M. et al. (2017) Novel pde10a transcript diversity in the human Abadi,M. et al. (2015) TensorFlow: large-scale machine learning on heteroge- striatum: insights into gene complexity, conservation and regulation. Gene, neous distributed systems. arXiv preprint arXiv: 1603.04467. 606, 17–24. Alipanahi,B. et al. (2015) Predicting the sequence specificities of DNA- and Morris,G.M. et al. (2009) AutoDock4 and AutoDockTools4: automated dock- RNA-binding proteins by deep learning. Nat. Biotechnol., 33, 831–838. ing with selective receptor flexibility. J. Comput. Chem., 30, 2785–2791. Angermueller,C. et al. (2016) Deep learning for computational biology. Mol. Muegge,I. (2006) Pmf scoring revisited. J. Med. Chem., 49, 5895–5902. Syst. Biol., 12, 878. Nketia,T.A. et al. (2017) Analysis of live cell images: methods, tools and Bajusz,D. et al. (2017) Structure-based virtual screening approaches in opportunities. Methods, 115, 65–79. kinase-directed drug discovery. Curr. Top. Med. Chem., 17, 2235–2259. O’Boyle,N.M. et al. (2011) Open Babel: an open chemical toolbox. Ballester,P.J. and Mitchell,J.B. (2010) A machine learning approach to pre- J. Cheminf., 3, 33. dicting protein–ligand binding affinity with applications to molecular dock- Olivecrona,M. et al. (2017) Molecular de-novo design through deep reinforce- ing. Bioinformatics, 26, 1169–1175. ment learning. J Cheminform., 9, 48. Breiman,L. et al. (1984) Classification and Regression Trees. CRC Press, Boca Park,Y. and Kellis,M. (2015) Deep learning for regulatory genomics. Nat. Raton, FL, USA. Biotechnol., 33, 825–826. Cheng,T. et al. (2012) Structure-based virtual screening for drug discovery: a Pettersen,E.F. et al. (2004) UCSF Chimera–a visualization system for explora- problem-centric review. AAPS J., 14, 133–141. tory research and analysis. J. Comput. Chem., 25, 1605–1612. Dahl,G.E. et al. (2014) Multi-task neural networks for QSAR predictions. Ragoza,M. et al. (2016) Protein–ligand scoring with convolutional neural net- arXiv preprint arXiv: 1406.1231v1, 1–21. works. J. Chem. Inf. Model., 57, 942–957. Durrant,J.D. and McCammon,J.A. (2010) NNScore: a neural-network-based Ramsundar,B. et al. (2017) Is multitask deep learning practical for pharma? scoring function for the characterization of protein–ligand complexes. J. Chem. Inf. Model., 57, 2068–2076. J. Chem. Inf. Model., 50, 1865–1871. 3674 M.M.Stepniewska-Dziubinska et al. Segler,M.H. et al. (2017) Generating focussed molecule libraries for drug dis- Wo ´ jcikowski,M. et al. (2015) Open drug discovery toolkit (oddt): a new covery with recurrent neural networks. ACS Cent. Sci., 4, 120–131. open-source player in the drug discovery field. J. Cheminf., 7, 26. Stepniewska-Dziubinska,M.M. et al. (2017) DeCAF—discrimination, com- Wo ´ jcikowski,M. et al. (2017) Performance of machine-learning scoring func- parison, alignment tool for 2d PHarmacophores. Molecules, 22, 1128. tions in structure-based virtual screening. Sci. Rep., 7, 46710. Stierand,K. and Rarey,M. (2010) Drawing the PDB: protein–ligand complexes Xu,Y. et al. (2015) Deep learning for drug-induced liver injury. J. Chem. Inf. in two dimensions. ACS Med. Chem. Lett., 1, 540–545. Model., 55, 2085–2093. Verdonk,M.L. et al. (2003) Improved protein–ligand docking using gold. Xu,Y. et al. (2017) Demystifying multi-task deep neural networks for quanti- Proteins Struct. Funct. Bioinf., 52, 609–623. tative structure-activity relationships. J. Chem. Inf. Model., 57, 2490–2504. Wallach,I. et al. (2015) AtomNet: A deep convolutional neural network for Zhang,L. et al. (2017) From machine learning to deep learning: progress in bioactivity prediction in structure-based drug discovery. arXiv preprint machine intelligence for rational drug discovery. Drug Discov. Today, 22, arXiv: 1510.02855. 1680–1685. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

Development and evaluation of a deep learning model for protein–ligand binding affinity prediction

Loading next page...
 
/lp/ou_press/development-and-evaluation-of-a-deep-learning-model-for-protein-ligand-8dZa30argG

References (52)

Publisher
Oxford University Press
Copyright
© The Author(s) 2018. Published by Oxford University Press.
ISSN
1367-4803
eISSN
1460-2059
DOI
10.1093/bioinformatics/bty374
Publisher site
See Article on Publisher Site

Abstract

Motivation: Structure based ligand discovery is one of the most successful approaches for aug- menting the drug discovery process. Currently, there is a notable shift towards machine learning (ML) methodologies to aid such procedures. Deep learning has recently gained considerable atten- tion as it allows the model to ‘learn’ to extract features that are relevant for the task at hand. Results: We have developed a novel deep neural network estimating the binding affinity of ligand– receptor complexes. The complex is represented with a 3D grid, and the model utilizes a 3D convo- lution to produce a feature map of this representation, treating the atoms of both proteins and ligands in the same manner. Our network was tested on the CASF-2013 ‘scoring power’ benchmark and Astex Diverse Set and outperformed classical scoring functions. Availability and implementation: The model, together with usage instructions and examples, is available as a git repository at http://gitlab.com/cheminfIBB/pafnucy. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online. 1 Introduction 2016; Ma et al., 2013). These methods are naturally capable of cap- Structure-based virtual screening techniques are some of the most turing non-linear and complex relationships in the available data. successful methods for augmenting the drug discovery process Rather than ‘manually’ creating rules using expert knowledge (Bajusz et al., 2017; Fradera and Babaoglu, 2017). With structure- and statistical inference, ML models use arbitrary functions with based screening, one tries to predict binding affinity (or more often, adjustable parameters that are capable of transforming the input (in a score related to it) between a target and a candidate molecule this scenario, a protein–ligand complex) to the output (a score based on a 3D structure of their complex. This allows to rank and related to protein–ligand binding affinity). Briefly, when the model prioritize molecules for further processing and subsequent testing. is presented with examples of input data paired with the desired out- Numerous scoring schemes have been developed to aid this process, come, it ‘learns’ to return predictions that are in agreement with the most of them use statistical and/or expert analysis of available pro- values provided. Typically the process of learning is incremental; by tein–ligand structures (Morris et al., 2009; Muegge, 2006; Verdonk introducing small changes to the model parameters, the prediction is et al., 2003). Currently, there is a notable shift towards scoring func- moved closer to the target value. Prime examples of ML scoring tions using machine learning (ML) methodologies, and this have functions are RF-Score (Ballester and Mitchell, 2010), which uses been highlighted by several reviews (Cheng et al., 2012; Lima et al., random forest, and NNscore (Durrant and McCammon, 2010, V The Author(s) 2018. Published by Oxford University Press. 3666 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Pafnucy 3667 2011), which uses an ensemble of shallow neural networks. These techniques to molecules encoded with SMILES strings (Jastrze ¸bski scoring functions were proven useful in virtual screening campaigns et al., 2016). and yielded more active compounds than their classical counterparts Application of deep models to de novo ligand design has also (Kinnings et al., 2011; Wo ´ jcikowski et al., 2017). been explored. There are several examples of models which use However, one drawback of such ML approaches is that they still autoencoders and/or recurrent neural networks that are able to pro- rely on feature engineering, i.e. they utilize expert knowledge to de- pose new molecules with desired properties (Ertl et al., 2017; fine rules that will become the basis of input data preprocessing. Go ´ mez-Bombarelli et al., 2017; Olivecrona et al., 2017; Segler et al., Hence, one can argue that they are just more sophisticated classical 2017). Using autoencoders also allows to represent molecules scoring functions with more complex rules. with short, real-valued vectors (extracted from the bottleneck layer), The ML rule of thumb says that in order to establish a good pre- which facilitates exploration of the chemical space (Go ´ mez- dictive model, the model needs a lot of data to be able to distinguish Bombarelli et al., 2017). more general trends and patterns from noise. The growing amount Although deep learning is more readily used in ligand-based of both structural data and affinity measurements has allowed regimes, there are currently a couple of interesting examples of researchers to explore deep learning. Briefly, a deep neural network structure-based neural networks. In AtomNet (Wallach et al., consists of multiple layers of non-linear transformations that extract 2015), input—molecular complex—is discretized to a 3D grid and and combine information from data to develop sophisticated rela- fed directly into a convolutional neural network. Instead of data pre- tionships between the input and the output. One of the main advan- processing, the model uses a learnable representation to recognize tages of deep learning is that it allows for the reduction of feature different groups of interacting atoms. AtomNet is a classification engineering: the model learns to extract features as a natural conse- method that yields 1 if the ligand is active and 0 otherwise. Another quence of the process of fitting the model’s parameters to the avail- similar model was created by Ragoza et al. (2016) and trained to able data. It is clear that choosing the representation of the input perform two independent classification tasks: activity and pose pre- data has a profound impact on the predictive power of a model. diction. However, with classification methods, we lose information Currently, there is a lot of effort in the field to incorporate feature about the strength of the interaction between the protein and the extraction directly into the ML model. In such an approach, a ligand. learnable molecule representation replaces classical descriptors and Since neural networks are also suitable for regression, Gomes fingerprints and becomes the first part of the model. Then, this rep- et al. (2017) created a model predicting the energy gap between a resentation is trained together with the predictive part of the model bounded protein–ligand complex and an unbounded state. In their to extract features that are useful in solving a specific task. With work, radial pooling filters with learnable mean and variance were such a design, it is therefore theoretically possible to find and quan- used to process the input. Such filters enabled the production of a tify relationships and/or mechanisms that have not yet been discov- summary of the atom’s environment and a representation that was ered or are unknown to the experts (Nketia et al., 2017; Zhang invariant to atom ordering and the orientation of the complex. et al., 2017). Taking into account the current findings and aforementioned Deep learning has been relatively widely used by the bioinfor- approaches, we have developed Pafnucy (pronounced ‘paphnusy’)— matics (Alipanahi et al., 2015; Jime ´ nez et al., 2017; Jurtz et al., a novel deep neural network tailored for many structure-based 2017; Leung et al., 2014; Park and Kellis, 2015) and computational approaches, including derivative prioritization and virtual screening. biology community (Angermueller et al., 2016). Several promising Similar to Ragoza et al. (2016), the input structure is represented examples of deep learning methods have also been shown for with a 3D grid, and a combination of convolutional and dense layers computer-aided drug design (CADD). In what follows, we first focus is used; however, our model tries to predict the exact binding affin- on ligand-based methods, which are in general more established, ity value. Pafnucy utilizes a more natural approach to atom descrip- and then we continue with structure-based models. tion in which both proteins and ligands have the same atom types. The simplest deep models in ligand-based design use molecular This approach serves as a regularization technique as it forces the fingerprints as feature vectors and fully-connected (dense) neural network to discover general properties of interactions between networks built on top of them. Such approaches were proven suc- proteins and ligands. Additionally, the design of Pafnucy provides cessful; they outperform other ML methods in predicting bioactivity insight into the feature importance and information extraction that (Lenselink et al., 2017) and other properties of small molecules, like is done during learning and the final prediction of binding affinity. aqueous solubility (Lusci et al., 2013) or toxicity (Xu et al., 2015). The network was implemented with TensorFlow (Abadi et al., Additionally, neural network model allows to easily create 2015) using Python API and trained on the PDBbind database (Liu multi-task classifiers or regressors, predicting, for example, activities et al., 2017). The source code, trained model and usage instructions against multiple targets at once. It has been shown that such QSAR are available as a git repository at http://gitlab.com/cheminfIBB/ models perform better than single-task networks (Dahl et al., 2014; pafnucy. Lenselink et al., 2017; Ma et al., 2015; Ramsundar et al., 2017; Xu et al., 2017), as they benefit from more training data, but also be- cause they are able to ‘share’ internal representations between tasks, 2 Materials and methods and therefore learn to recognize more general patterns in the data. As mentioned previously, neural networks allow for more flexi- 2.1 Data bility in terms of how the data are provided to the model, so that it 2.1.1 Representation of a molecular complex might learn the best representation for its purpose. One approach to Three-dimensional structures of protein–ligand complexes require achieve this is to use convolutions on the molecular graph, allowing specific transformations and encoding in order to be utilized by relevant patterns in this graph to be identified (Duvenaud et al., a neural network. In our approach, we cropped the complex to a 2015; Kearnes et al., 2016). Another way is to use a recurrent neural defined size of 20-A cubic box focused at the geometric center of a network on directed acyclic representations of the molecular graph ligand. We then discretized the positions of heavy atoms using a 3D (Lusci et al., 2013), or even apply natural-language processing grid with 1-A resolution (see Supplementary Fig. S1). This approach 3668 M.M.Stepniewska-Dziubinska et al. allowed for the representation of the input as a 4D tensor in which complexes) was used as an external test set, (iii) all other complexes each point is defined by Cartesian coordinates (the first 3 dimen- (remainder of the refined set and the general set, 11906 in total) sions of the tensor) and a vector of features (the last dimension). were used as the training set. In summary, the general and refined In Pafnucy, 19 features were used to describe an atom: sets were used to train the model and select the hyperparameters, while the core set was used as an external test set that was unknown 9 bits (one-hot or all null) encoding atom types: B, C, N, O, P, S, to the model during training and validation. The scheme illustrating Se, halogen and metal relationships between the subsets and dataset partitioning is avail- 1 integer (1, 2, or 3) with atom hybridization: hyb able in Supplementary Figure S2. 1 integer counting the numbers of bonds with other heavyatoms: Atomic features were calculated using Open Babel (O’Boyle heavy_valence et al., 2011), and the complexes were transformed into grids. Helper 1 integer counting the numbers of bonds with other heteroatoms: functions used to prepare the data and Jupyter Notebook with all hetero_valence preprocessing steps are available at http://gitlab.com/cheminfIBB/ 5 bits (1 if present) encoding properties defined with SMARTS pafnucy. patterns: hydrophobic, aromatic, acceptor, donor and ring As an additional external test set, we used 73 complexes from 1 float with partial charge: partialcharge the Astex Diverse Set (Hartshorn et al., 2007). This dataset, al- 1 integer (1 for ligand, -1 for protein) to distinguish between the though substantially smaller than PDBbind subsets, provides two molecules: moltype Pafnucy with structures from an independent source, and can help in detecting generalization problems related to database-specific The SMARTS patterns were defined the same way as in our pre- artefacts. Of 85 complexes in the Astex Diverse Set, we excluded vious project (Stepniewska-Dziubinska et al., 2017). The partial those without binding affinity (11 complexes) and those present in charges were scaled by the training set’s standard deviation in order the PDBbind database (a single complex, PDB ID: 1YVF, was pre- to get a distribution with a unit standard deviation, which improves sent in the general set). The remaining structures were prepared the learning. In case of collisions (multiple atoms in a single grid point), ˚ same way as the PDBbind database. This dataset was used in order which rarely occur for a 1-A grid, features from all colliding atoms to test Pafnucy on structures from a different source. were added. 2.1.2 Dataset preparation The network was trained and tested with protein–ligand complexes 2.2 Network from the PDBbind database v. 2016 (Liu et al., 2017). This database 2.2.1 Architecture consists of 3D structures of molecular complexes and their corre- The architecture used in Pafnucy is a deep convolutional neural net- sponding binding affinities expressed with pK (log K or log K ) work with a single output neuron for predicting the binding affinity. a d i values. PDBBind complexes were divided by Liu et al. into 3 over- The model consists of two parts: the convolutional and dense parts, lapping subsets. The general set includes all available data. From with different types of connections between layers (see Fig. 1). this set, the refined set, which comprises complexes with higher Convolution, from which the name ‘convolutional’ stems, is a math- quality, is subtracted. Finally, the complexes from the refined set are ematical operation that mixes two functions together. Most neural clustered by protein similarity, and 5 representative complexes are network libraries actually substitute the convolution operation with selected from each cluster. This fraction of the database is called the cross-correlation (Goodfellow et al., 2016), which has a more intui- core set and is designed as a high-quality benchmark for structure- tive interpretation and measures the similarity of two functions. The based CADD methods. model discovers patterns that are encoded by the filters in the convo- To properly employ PDBbind information and prevent data lutional layer and creates a feature map with spatial occurrences for leakage, we have split the data into disjoint subsets, i.e. the refined each pattern in the data. set was subtracted from the general set, and the core set was sub- Pafnucy’s input—molecular complex—is represented with a 4D tracted from the refined set so that there are no overlaps between tensor and treated like a 3D image with multiple color channels. the three subsets. Next, we have discarded all protein–protein, pro- Each position of an input (x, y and z coordinates) is described by a tein–nucleic acid, and nucleic acid–ligand complexes from these new vector of 19 properties (see Section 2.1.1), which is analogous to datasets. Finally, in order to evaluate our model with the CASF- how each pixel of an image (x and y coordinates) is described by a 2013 ‘scoring power’ benchmark (Li et al., 2014), we needed to ex- vector of intensities of three basic colors. clude all data that overlap with the 195 complexes used in CASF- First, the input is processed by a block of 3D convolutional 2013. We therefore excluded a total of 87 overlapping complexes (5 layers combined with a max pooling layer. Pafnucy uses 3 convolu- were part of the general set, and 82 were part of the refined set) tional layers with 64, 128 and 256 filters. Each layer has 5-A cubic from the training and validation sets. For the list of excluded struc- filters and is followed by a max pooling layer with a 2-A cubic tures see Supplementary Table S1. patch. The result of the last convolutional layer is flattened and used All complexes used in this study were protonated and charged as input for a block of dense (fully-connected) layers. We used 3 using UCSF Chimera (Pettersen et al., 2004) with Amber ff14SB for dense layers with 1000, 500 and 200 neurons. In order to improve standard residues and AM1-BCC for non-standard residues and generalization, dropout with drop probability of 0.5 was used for all ligands. No additional improvements nor calibration was performed dense layers. We also experimented with 0.2 dropout and no drop- on the complexes; this default protocol was chosen to be in line with out and achieved worse results on the validation set. (Li et al., 2014) to be able to compare Pafnucy to other methods Both convolutional and dense layers are composed of rectified tested on the CASF-2013 ‘scoring power’ benchmark. linear units (ReLU). ReLU was chosen because it speeds up the The remaining complexes of the PDBbind v. 2016 dataset were learning process compared with other types of activations. We also divided as follows: (i) 1000 randomly selected complexes from the experimented with Tanh units and achieved a very similar prediction refined set were used in validation, (ii) the whole core set (290 accuracy, but learning was much slower. Pafnucy 3669 in the dense layers were initialized with a truncated normal distribu- pffiffiffi tion with 0 mean and a standard deviation of 1= n, where n is the number of incoming neurons for a given layer. The corresponding biases were set to 1.0. The Adam optimizer was used to train the network with a 10 learning rate and 5 examples per mini-batch (The training set con- tains 11906 complexes; therefore, the last batch actually consisted of 6 complexes instead of 5.). Larger batch sizes (10 and 20 exam- ples) were also tested but resulted in worse performance. Training was carried out for 20 epochs, and the model with the lowest error on the validation set was selected (in the case of the network described in this work, it was after 14 epochs of training). To reduce overfitting, we used the dropout approach mentioned earlier and L2 weight decay with k ¼ 0:001. Using a higher value (k ¼ 0:01) decreased the model’s capacity too much and resulted in higher training and validation errors. In addition to providing regularization, L2 allows us to investigate feature importance. If a weight differs from 0 considerably, information it transfers must be important for the model to make a prediction (see Section 4). An important part of our approach was to develop a model that was not sensitive to ligand–receptor complex orientation. Therefore every structure was presented to the network in 24 different orienta- tions (i.e. all possible combinations of 90 rotations of a cubic box), yielding 24 different training examples per protein–ligand complex. By using systematic rotations of complexes during training, we anticipated that the network would learn more general rules about protein–ligand interactions and lead to better performance on new data. Indeed, in our experiments, we observed a much worse per- formance of models trained on single orientations regardless of the hyperparameters used to define a particular network. 3 Results The error on training and validation sets was monitored during learning (see Supplementary Fig. S3). Although the model was trained on 24 different rotations of each complex, the RMSE (root mean square error) was calculated for the original orientation only in order to speed up the computations. After 14 epochs of training, the model started to overfit, and the error on the validation set started to slowly yet steadily increase. The best set of weights of the network, obtained after 14 epochs of training, was saved and used as the final model. Model performance was evaluated on all subsets of the data (see Table 1 and Fig. 2). For each complex in the dataset, affinity was predicted and compared to the real value. Prediction error was measured with RMSE and MAE (mean absolute error). The correlation between the scores and ex- perimentally measured binding constants was assessed with the Pearson’s correlation coefficient (R) and the standard deviation in regression (SD). SD is a measure used in CASF (Li et al., 2014) and is defined as follows: vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u X t 2 SD ¼ ½t ðÞ ay þ b i i N  1 Fig. 1. Pafnucy’s architecture. The molecular complex is represented with a i¼1 4D tensor, processed by threee convolutional layers and three dense (fully- connected) layers to predict the binding affinity where t and y are the measured and predicted affinities for the ith i i complex, whereas a and b are the slope and the intercept of the re- 2.2.2 Training gression line between measured and predicted values, respectively. The initial values of the convolutional filter weights were drawn As expected, the network achieves the lowest error on the train- from a truncated normal distribution with 0 mean and 0.001 stand- ing set (Fig. 2c), which was used to find the weights of the network. ard deviation and corresponding biases were set to 0.1. The weights More importantly, Pafnucy also returns accurate predictions for the 3670 M.M.Stepniewska-Dziubinska et al. two test sets (Fig. 2a and b), which were unknown to the model dur- We also compared Pafnucy to X-Score on the Astex Diverse Set ing training and validation. The results on the CASF-2013 ‘scoring (Table 3). This experiment provides Pafnucy with a test set com- power’ benchmark (PDBbind v. 2013 core set), although substan- pletely separate from the data provided by Liu et al. tially worse than for other subsets, are still better than those for any Both methods have comparable errors to those obtained on the other scoring function tested by Li et al. (2014)—the best- PDBbind data. As expected, Pafnucy outperforms X-Score on the performing X-Score had R ¼ 0.61 and SD ¼ 1.78, while our model Astex Diverse Set, regardless of which measure is used. The observed achieved R ¼ 0.70 and SD ¼ 1.61 (see Table 2). To our knowledge, correlation, however, is lower for both methods. This effect is par- the only scoring function with better performance published so far is tially due to the fact that the Astex dataset contains only 73 com- RF-Score v3, which achieves R ¼ 0.74 and SD ¼ 1.51 on CASF-2013 plexes, and therefore, correlation is much more sensitive to small [results were calculated with ODDT (Wo ´ jcikowski et al., 2015)]. changes in the predictions than for bigger subsets. Table 1. Pafnucy’s performance 4 Discussion Dataset RMSE MAE SD R 4.1 stability of the results with respect to input rotation Test (v. 2016 core set) 1.42 1.13 1.37 0.78 One of the biggest challenges of this project was to properly handle Validation 1.44 1.14 1.43 0.72 the orientation of a molecular complex. This problem occurs also in Training 1.21 0.95 1.19 0.77 image recognition—the input looks differently when an object is shown from a different angle, yet it contains the same information Note: Prediction accuracy for each subset was evaluated using four differ- ent metrics (see main text). about the underlying real object. There are two main approaches to (a)(b) (c)(d) Fig. 2. Predictions for two test sets (core sets from PDBbind v. 2016 and v. 2013), training set and validation set Pafnucy 3671 Table 2. Results on the CASF-2013 ‘scoring power’ benchmark (PDBbind v. 2013 core set) a b c a Pafnucy X-Score ChemScore ChemPLP PLP1 G-Score SD 1.61 1.78 1.82 1.84 1.86 1.87 R 0.70 0.61 0.59 0.58 0.57 0.56 Note: Only the five best performing scoring functions are presented, for full results see (Li et al., 2014). SYBYL. GOLD. Discovery Studio. Table 3. Predictions accuracy on the Astex Diverse Set Method RMSE MAE SD R Pafnucy 1.43 1.13 1.43 0.57 X-Score 1.55 1.22 1.48 0.52 Fig. 3. Range of weights for each input channel (feature). Outliers are not shown dealing with this issue: by using a representation invariant to the complex position (e.g. molecule-level features or internal coordi- nates), or by creating a model that would be robust to the input details). During training, the weights tend to spread and form wider orientation. The model presented in this work uses the latter ap- ranges, as weights with higher absolute values pass more informa- proach, similarly to how 2D convolutional neural networks are used tion to the deeper layers of the network. Because Pafnucy was in image recognition. Therefore, to generalize well, the model trained with L2 regularization, only crucial weights were likely to needed to learn to extract information from differently presented in- have such high absolute values. put. In order to achieve this, we augmented the dataset with system- The input was represented using 19 channels, some of which atic rotations of the input data. If Pafnucy was trained correctly, it were expected to be of low relevance for the model (e.g. the boron should return similar predictions regardless of the orientation of the atom type). As we can see in the Figure 3, the feature with the widest complex. range is the moltype—feature distinguishing the protein from the lig- To test the model’s stability we selected the PDE10A protein, a and. This result implies that Pafnucy learned that binding affinity cAMP/cGMP phosphodiesterase important in signal transduction depends on the relationship between the two molecules and that rec- and recently linked to neuropsychiatric disorders (MacMullen et al., ognizing them is crucial. Additionally, the weights for selenium and 2017). PDE10A is complexed with 57 different ligands in the boron atom types (Se and B, respectively) barely changed during PDBBind database (41 complexes in the training set, 6 in the valid- training and are close to 0. This result can be interpreted in two ation and 10 in the test set). Each of the complexes was presented to ways: either the network found other features of protein–ligand the model in 24 different rotations, and the distribution of returned complexes more important for binding affinity, or due to infrequent predictions was analyzed. As anticipated, the variability of the pre- occurrence of these atom types in ligands the network was not able dicted binding constants is low (see Supplementary Fig. S4). to find any general patterns for their influence on binding affinity. Additionally, the variability does not depend on the value of the pre- To further inspect how the network utilizes the input, we ana- diction nor the subset the molecule belongs to. lyzed the impact of missing data on the prediction. To inspect this, we selected one of the PDE10A complexes with a benzimidazole inhibitor (complex PDB ID: 3WS8; ligand PDB ID: X4C). The ex- 4.2 How Pafnucy sees and processes the data periment was carried out as follows: we produced 343 corrupted Neural networks are often deemed harder to analyze and interpret complexes with some missing data and predicted the binding affinity than simpler models and are sometimes regarded as ‘black-boxes’. for each. The missing data were produced by deleting a 5-A cubic The worry is that a model can yield good predictions for the wrong box from the original data. We slid the box with a 3-A step (in every reasons (e.g. artefacts hidden in the data) and therefore will not gen- direction), thus yielding 7 ¼ 343 corrupted inputs. Next, we eralize well for new datasets. In order to trust a neural network and rotated the complex by 180 about the X-axis and followed the its predictions, one needs to ensure that the model uses information same procedure, thus yielding another 343 corrupted inputs. Then, that is relevant to the task at hand. In this section, we analyze which for each of the two orientations, we took 10 corrupted inputs that parts of the input are the most important and have the biggest im- had the highest drop in predicted affinity (Fig. 4). We wanted to find pact on the predictions. which atoms’ absence caused the highest drops in the predictions. In the case of random forests, for example, there is an established As we can see in Figure 4a and b, for both orientations, we iden- way to calculate feature importance based on the impurity decrease tified the same region containing the ligand and its nearest neigh- (Breiman et al., 1984). With neural networks, there is no such con- bourhood. The boxes contain the amino-acids participating in the sensus, as the interpretation of the model’s parameters may differ interactions with the ligand, i.e. Gln726, which forms a hydrogen considerably between networks with different architectures. bond, and Phe729, which forms a p  p interaction with the ligand In the case of Pafnucy, which was trained with L2, we can esti- (Fig. 4c). mate feature importance by looking at the distributions of weights Additionally, if we considered 15 corrupted complexes with the associated with the convolutional filters in the first hidden layer. highest drop in predictions, we find other amino-acids interacting Their initial values were close to 0 (see Section 2.2.2 for more 3672 M.M.Stepniewska-Dziubinska et al. Fig. 5. Activations on the hidden layers for two orientations of the PDE10A complex (PDB ID: 3WS8). Darker colors indicate higher values. Cosine distan- ces (d) between the activation patterns for each layer are provided different orientations of the complex (the second rotated about the X-axis by 180 ). For this inquiry, we analyzed the activations of the hidden layers for the two inputs. In Figure 5, we can see that the first hidden layer has very differ- ent activation patterns for the two orientations of the input. Pafnucy gets very different data and needs to use different filters in the first convolutional layer to process them. However, the closer we get to the output layer, the more similar the activations become. We can clearly see that our model learned to extract the same information from differently presented data. 5 Conclusions In this work we presented a deep neural network, Pafnucy, which can be used in structure based ligand discovery campaigns; as a scor- ing function in virtual screening or affinity predictor for novel mole- cules after a complex is generated. Pafnucy can be also utilized directly during the docking procedure to guide ligand pose optimiza- tion. The model was tested on the CASF-2013 ‘scoring power’ benchmark and outperformed all 20 state-of-the-art scoring func- Fig. 4. The most important parts of the input. Regardless of the complex tions tested by the CASF-2013 authors. The results obtained and the orientation, the same region of the input had the highest impact on the pre- careful analysis of the network show that Pafnucy makes reliable diction. Note that the second plot is rotated back about the X-axis to ease the predictions based on relevant features. comparison. (a) Original orientation. (b) Rotated by 180 about the X-axis. Predicting the impact of small molecules on diverse biologically im- (c) Protein–ligand interactions. Graphic was generated with Poseview portant protein targets has long been sought by researchers. Pafnucy (Stierand and Rarey, 2010) can be either applied to test multiple compounds against a single pro- with the ligand: Tyr693, which forms a hydrogen bond, and tein, or to test multiple proteins against a single compound. It can there- Met713, which forms hydrophobic contacts with the ligand. The fore help in discovering new potential drugs, but also in investigating methodology presented above can be applied to other complexes in side effects of bioactive molecules. By anticipating the potential impact order to elucidate specific ligand–receptor interactions with the of new drugs on the biology of the cell, Pafnucy may contribute to such most profound effect on the prediction. disciplines as systems medicine and systems biology. Going back to the uncorrupted input, we wanted to investigate The approaches used in the analyses presented in the how Pafnucy managed to give almost identical predictions for two ‘Discussion’ are general and can be applied to other predictive Pafnucy 3673 Durrant,J.D. and McCammon,J.A. (2011) NNScore 2.0: a neural-network models for drug discovery. Finding the most important features and receptor–ligand scoring function. J. Chem. Inf. Model., 51, 2897–2903. parts of a molecular complex can help researchers to design better Duvenaud,D.K. et al. (2015) Convolutional networks on graphs for learning compounds, but also to better understand and improve their molecular fingerprints. In: Cortes,C. (ed.) Neural Information Processing models. Systems, pp. 2215–2223. Because Pafnucy is a neural network, it is also possible to calcu- Ertl,P. et al. (2017) In silico generation of novel, drug-like chemical matter late and analyze its gradients. During training, gradients are used to using the lstm neural network. arXiv preprint arXiv: 1712.07449. optimize model parameters. However, they can also be calculated Fradera,X. and Babaoglu,K. (2017) Overview of methods and strategies for for the input and point to beneficial changes in a molecule’s con- conducting virtual small molecule screening. Curr. Protoc. Chem. Biol., 9, formation (finding optimal pose during docking) or composition 196–212. (lead optimization). Gomes,J. et al. (2017) Atomic convolutional networks for predicting protein–- ligand binding affinity. arXiv preprint arXiv: 1703.10603. Pafnucy and its source code, together with the Jupyter Notebooks Go ´ mez-Bombarelli,R. et al. (2017) Automatic chemical design using a used to prepare the data and analyze the results, are freely available at data-driven continuous representation of molecules. ACS Cent. Sci., 4, http://gitlab.com/cheminfIBB/pafnucy. Usage examples and scripts are 268–276. also available to facilitate the most common use-cases: preparing the Goodfellow,I. et al. (2016) Deep Learning. MIT Press, Cambridge, MA. input data, predicting binding affinity and training a new network. Hartshorn,M.J. et al. (2007) Diverse, high-quality test set for the validation of We hope that these features will make Pafnucy easily applicable and protein–ligand docking performance. J. Med. Chem., 50, 726–741. adaptable by other researchers. In addition, we are working on a Jastrze¸bski,S. et al. (2016). Learning to SMILE(S). In: International more flexible implementation of the model, that will allow the user to Conference on Learning Representation 2016 (Workshop track). easily manipulate network parameters and molecular complex repre- Jime ´ nez,J. et al. (2017) Deepsite: protein-binding site predictor using 3d-con- sentation, with minimal programming knowledge. volutional neural networks. Bioinformatics, 33, 3036–3042. Jurtz,V.I. et al. (2017) An introduction to deep learning on biological sequence Preparing the environment with all needed dependencies and data: examples and solutions. Bioinformatics, 33, 3685–3690. using the model for the new data can be done with minimum effort: Kearnes,S. et al. (2016) Molecular graph convolutions: moving beyond finger- git clone https://gitlab.com/cheminfIBB/pafnucy prints. J. Comput. Aided Mol. Des., 30, 595–608. Kinnings,S.L. et al. (2011) A machine learning-based method to improve dock- cd pafnucy ing scoring functions and its application to drug repurposing. J. Chem. Inf. conda env create -f environment_gpu.yml Model., 51, 408–419. source activate pafnucy_env Lenselink,E.B. et al. (2017) Beyond the hype: deep neural networks outper- python prepare.py -l ligand.mol2 -p pocket.mol2 -o data.hdf form established methods using a ChEMBL bioactivity benchmark set. J. Cheminf., 9, 45. python predict.py -i data.hdf -o predictions.csv Leung,M.K.K. et al. (2014) Deep learning of the tissue-regulated splicing code. Bioinformatics, 30, i121–i129. Li,Y. et al. (2014) Comparative assessment of scoring functions on an updated Acknowledgements benchmark: 2. evaluation methods and general results. J. Chem. Inf. Model., 54, 1717–1736. The authors thank Maciej Wo ´ jcikowski for providing the data and X-Score Lima,A.N. et al. (2016) Use of machine learning approaches for novel drug results for the Astex Diverse Set and Maciej Dziubinski  for his help in revising discovery. Exp. Opin. Drug Discov., 11, 225–239. the manuscript. Liu,Z. et al. (2017) Forging the basis for developing protein–ligand interaction scoring functions. Accounts Chem. Res., 50, 302–309. Lusci,A. et al. (2013) Deep architectures and deep learning in chemoinfor- Funding matics: the prediction of aqueous solubility for drug-like molecules. This work was supported by the Polish Ministry of Science and Higher J. Chem. Inf. Model., 53, 1563–1575. Education (POIG.02.03.00-00-003/09-00 and IP2010 037470). Ma,D.-L. et al. (2013) Drug repositioning by structure-based virtual screening. Chem. Soc. Rev., 42, 2130–2141. Ma,J. et al. (2015) Deep neural nets as a method for quantitative References structure-activity relationships. J. Chem. Inf. Model., 55, 263–274. MacMullen,C.M. et al. (2017) Novel pde10a transcript diversity in the human Abadi,M. et al. (2015) TensorFlow: large-scale machine learning on heteroge- striatum: insights into gene complexity, conservation and regulation. Gene, neous distributed systems. arXiv preprint arXiv: 1603.04467. 606, 17–24. Alipanahi,B. et al. (2015) Predicting the sequence specificities of DNA- and Morris,G.M. et al. (2009) AutoDock4 and AutoDockTools4: automated dock- RNA-binding proteins by deep learning. Nat. Biotechnol., 33, 831–838. ing with selective receptor flexibility. J. Comput. Chem., 30, 2785–2791. Angermueller,C. et al. (2016) Deep learning for computational biology. Mol. Muegge,I. (2006) Pmf scoring revisited. J. Med. Chem., 49, 5895–5902. Syst. Biol., 12, 878. Nketia,T.A. et al. (2017) Analysis of live cell images: methods, tools and Bajusz,D. et al. (2017) Structure-based virtual screening approaches in opportunities. Methods, 115, 65–79. kinase-directed drug discovery. Curr. Top. Med. Chem., 17, 2235–2259. O’Boyle,N.M. et al. (2011) Open Babel: an open chemical toolbox. Ballester,P.J. and Mitchell,J.B. (2010) A machine learning approach to pre- J. Cheminf., 3, 33. dicting protein–ligand binding affinity with applications to molecular dock- Olivecrona,M. et al. (2017) Molecular de-novo design through deep reinforce- ing. Bioinformatics, 26, 1169–1175. ment learning. J Cheminform., 9, 48. Breiman,L. et al. (1984) Classification and Regression Trees. CRC Press, Boca Park,Y. and Kellis,M. (2015) Deep learning for regulatory genomics. Nat. Raton, FL, USA. Biotechnol., 33, 825–826. Cheng,T. et al. (2012) Structure-based virtual screening for drug discovery: a Pettersen,E.F. et al. (2004) UCSF Chimera–a visualization system for explora- problem-centric review. AAPS J., 14, 133–141. tory research and analysis. J. Comput. Chem., 25, 1605–1612. Dahl,G.E. et al. (2014) Multi-task neural networks for QSAR predictions. Ragoza,M. et al. (2016) Protein–ligand scoring with convolutional neural net- arXiv preprint arXiv: 1406.1231v1, 1–21. works. J. Chem. Inf. Model., 57, 942–957. Durrant,J.D. and McCammon,J.A. (2010) NNScore: a neural-network-based Ramsundar,B. et al. (2017) Is multitask deep learning practical for pharma? scoring function for the characterization of protein–ligand complexes. J. Chem. Inf. Model., 57, 2068–2076. J. Chem. Inf. Model., 50, 1865–1871. 3674 M.M.Stepniewska-Dziubinska et al. Segler,M.H. et al. (2017) Generating focussed molecule libraries for drug dis- Wo ´ jcikowski,M. et al. (2015) Open drug discovery toolkit (oddt): a new covery with recurrent neural networks. ACS Cent. Sci., 4, 120–131. open-source player in the drug discovery field. J. Cheminf., 7, 26. Stepniewska-Dziubinska,M.M. et al. (2017) DeCAF—discrimination, com- Wo ´ jcikowski,M. et al. (2017) Performance of machine-learning scoring func- parison, alignment tool for 2d PHarmacophores. Molecules, 22, 1128. tions in structure-based virtual screening. Sci. Rep., 7, 46710. Stierand,K. and Rarey,M. (2010) Drawing the PDB: protein–ligand complexes Xu,Y. et al. (2015) Deep learning for drug-induced liver injury. J. Chem. Inf. in two dimensions. ACS Med. Chem. Lett., 1, 540–545. Model., 55, 2085–2093. Verdonk,M.L. et al. (2003) Improved protein–ligand docking using gold. Xu,Y. et al. (2017) Demystifying multi-task deep neural networks for quanti- Proteins Struct. Funct. Bioinf., 52, 609–623. tative structure-activity relationships. J. Chem. Inf. Model., 57, 2490–2504. Wallach,I. et al. (2015) AtomNet: A deep convolutional neural network for Zhang,L. et al. (2017) From machine learning to deep learning: progress in bioactivity prediction in structure-based drug discovery. arXiv preprint machine intelligence for rational drug discovery. Drug Discov. Today, 22, arXiv: 1510.02855. 1680–1685.

Journal

BioinformaticsOxford University Press

Published: May 10, 2018

There are no references for this article.