Predicting Drug-Induced Liver Injury Using Ensemble Learning Methods and Molecular Fingerprints

Predicting Drug-Induced Liver Injury Using Ensemble Learning Methods and Molecular Fingerprints Abstract Drug-induced liver injury (DILI) is a major safety concern in the drug-development process, and various methods have been proposed to predict the hepatotoxicity of compounds during the early stages of drug trials. In this study, we developed an ensemble model using 3 machine learning algorithms and 12 molecular fingerprints from a dataset containing 1241 diverse compounds. The ensemble model achieved an average accuracy of 71.1 ± 2.6%, sensitivity (SE) of 79.9 ± 3.6%, specificity (SP) of 60.3 ± 4.8%, and area under the receiver-operating characteristic curve (AUC) of 0.764 ± 0.026 in 5-fold cross-validation and an accuracy of 84.3%, SE of 86.9%, SP of 75.4%, and AUC of 0.904 in an external validation dataset of 286 compounds collected from the Liver Toxicity Knowledge Base. Compared with previous methods, the ensemble model achieved relatively high accuracy and SE. We also identified several substructures related to DILI. In addition, we provide a web server offering access to our models (http://ccsipb.lnu.edu.cn/toxicity/HepatoPred-EL/). DILI, hepatotoxicity, molecular fingerprints, machine learning, ensemble The liver is a vital organ of the body and plays an important role in metabolism. Drug-induced liver injury (DILI) is one of the leading causes of drug failure in trials and withdrawal from the market (Björnsson et al., 2006; Segall and Barber, 2014). Thus, determining the hepatotoxicity of compounds is essential. Over the past decades, various approaches have been developed to assess the risk of DILI, both in vivo and in vitro (Jennen et al., 2014; Tomida et al., 2015). However, these studies are expensive, time-consuming, and may not yield high correlations between experimental results and effects observed in humans. Therefore, computational approaches to predict the hepatotoxicity of compounds from chemical structure properties are gradually being recognized as an effective tool. In silico approaches are a low-cost, fast method to collect information on potential toxicity, and great efforts have been made in hepatotoxicity prediction in recent years (Ekins, 2014; Przybylak and Cronin, 2012). In particular, quantitative structure-activity relationship (QSAR) models (Ai et al., 2014, 2010; Dearden, 2016; Golbraikh and Tropsha, 2002; Muster et al., 2008; Zhu and Kruhlak, 2014), which aim to explain the relationships between chemical structure features (eg, molecular descriptors) and biologic activities (such as hepatotoxicity, mutagenicity, and carcinogenicity) based on known activity datasets using various statistical algorithms, could predict the potential hepatotoxicity of a new compound before its synthesis. Thus our model could be used in the early stage of drug development, which aims to filter out the compounds with the potential risk of hepatotoxicity before in vivo and in vitro research (Cheng et al., 2013; Merlot, 2010). Various QSAR models for predicting hepatotoxicity have been reported, most of which have used machine learning methods (Chen et al., 2013b; Ekins et al., 2010; Liew et al., 2011; Zhang et al., 2016a,b). Ekins et al. (2010) developed a Bayesian modeling method with extended connectivity fingerprints and other interpretable descriptors, based on a training set of 295 compounds, and applied it to 237 compounds for external validation, achieving an accuracy of 57%–59% on the training set and 60% on the external test. Liew et al. (2011) presented an ensemble model of 617 base classifiers which used support vector machine (SVM) and k-nearest neighbor (k-NN) methods based on a diverse set of 1087 compounds, and achieved an overall accuracy of 63.8% in 5-fold cross-validation and 75.0% in an external validation of 120 compounds. Zhang et al. (2016a) built a model using 5 machine learning methods based on MACCS and FP4 fingerprints after evaluation by substructure pattern recognition, and reported that the best model used SVM together with FP4 fingerprint index at an IG value threshold of 0.0005, and achieved an overall accuracy of 79.7% on the training set and 64.5% on the test set. Whereas these models are widely applicable, their accuracy in forecasting the hepatotoxicity of new compounds (estimated by cross-validation or external testing) remains unsatisfactory. Moreover, many models have not even been evaluated by an appropriate cross-validation. Molecular fingerprints have been used in drug development and toxicity prediction (Marzorati et al., 2008), including in some of the models above. In this study, we used 12 types of molecular fingerprints and 3 machine learning methods to predict the hepatotoxicity of diverse organic compounds. Furthermore, we built several ensemble methods that combine various models generated by a variety of fingerprint subsets and machine learning methods. To demonstrate the prediction capability and reliability of the models, the models will be evaluated by 5-fold cross-validation with 100 repeats and external validation. MATERIALS AND METHODS Data Preparation We derived the training set used to build the model from 2 papers: Liew et al. (2011) used 1274 compounds for the analysis and modeling process, most of which were included in the U.S. Food and Drug Administration (FDA) Orange Book of approved drug products with therapeutic equivalence evaluations. The authors had taken an extremely reserved approach in the labeling so that any drug with the potential to cause any adverse liver effects was flagged as “positive,” and others without potential were labeled as “negative.” Zhu and Kruhlak (2014) used a calibration set of 282 compounds included in United States and European toxicity registries. The authors used a classification approach similar to Liew’s, which was based on whether alerting publications or warnings related to liver injury were found, and labeled the 282 compounds from previous publications as “positive” and “negative.” To gain a robust training set, we integrated the 2 datasets whereas deleting duplicates and compounds containing fewer than 3 carbon atoms that provide oversimplified structure. Besides, we deleted drugs that appeared in the external validation dataset from the training set. At last, we had built a training dataset of 1241 compounds containing 683 positives and 558 negatives. Details of the training dataset are provided in Supplementary Table 1. To further evaluate the predictive performance of our models, we used the FDA’s Liver Toxicity Knowledge Base (LTKB) (Chen et al., 2011, 2013a) as the external validation dataset; this is a benchmark dataset containing 287 prescription drugs, including 137 drugs with known hepatotoxicity, 85 drugs, which may cause DILI, and 65 drugs with no DILI indications on their labels. We removed gemtuzumab from the dataset because there is no simplified molecular input entry specification (SMILES) format of it in the LTKB benchmark dataset, which we used to calculate the molecular fingerprints. Thus, we built an external validation dataset of 286 compounds, which contains 221 DILI positives and 65 DILI negatives. Details of the external validation dataset are provided in Supplementary Table 2. Calculation of Molecular Fingerprints Twelve types of molecular fingerprints were generated to represent the chemical structure features of the compounds. The names and dimensions of these fingerprints are summarized in Table 1. All fingerprints were calculated by the PaDEL-Descriptor software (version 2.21) using SMILES formats of all the compounds (Yap, 2011). Each bit of these molecular fingerprints was used as a feature in the machine learning process. Table 1. Summary of the 12 Types of Molecular Fingerprints Generated in the Study Fingerprint Type Abbreviation Pattern Type Size (bits) Selected (bits) CDK CDK Hash fingerprints 1024 1019 CDK Extended CDKExt Hash fingerprints 1024 1002 CDK Graph CDKGraph Hash fingerprints 1024 597 Estate Estate Structural features 79 22 MACCS MACCS Structural features 166 114 Pubchem Pubchem Structural features 881 221 Substructure FP4 Structural features 307 41 Substructure Count FP4C Structural features count 307 38 Klekota-Roth KR Structural features 4860 228 Klekota-Roth Count KRC Structural features count 4860 161 2D Atom Pairs AP2D Structural features 780 97 2D Atom Pairs Count AP2DC Structural features count 780 61 Fingerprint Type Abbreviation Pattern Type Size (bits) Selected (bits) CDK CDK Hash fingerprints 1024 1019 CDK Extended CDKExt Hash fingerprints 1024 1002 CDK Graph CDKGraph Hash fingerprints 1024 597 Estate Estate Structural features 79 22 MACCS MACCS Structural features 166 114 Pubchem Pubchem Structural features 881 221 Substructure FP4 Structural features 307 41 Substructure Count FP4C Structural features count 307 38 Klekota-Roth KR Structural features 4860 228 Klekota-Roth Count KRC Structural features count 4860 161 2D Atom Pairs AP2D Structural features 780 97 2D Atom Pairs Count AP2DC Structural features count 780 61 Table 1. Summary of the 12 Types of Molecular Fingerprints Generated in the Study Fingerprint Type Abbreviation Pattern Type Size (bits) Selected (bits) CDK CDK Hash fingerprints 1024 1019 CDK Extended CDKExt Hash fingerprints 1024 1002 CDK Graph CDKGraph Hash fingerprints 1024 597 Estate Estate Structural features 79 22 MACCS MACCS Structural features 166 114 Pubchem Pubchem Structural features 881 221 Substructure FP4 Structural features 307 41 Substructure Count FP4C Structural features count 307 38 Klekota-Roth KR Structural features 4860 228 Klekota-Roth Count KRC Structural features count 4860 161 2D Atom Pairs AP2D Structural features 780 97 2D Atom Pairs Count AP2DC Structural features count 780 61 Fingerprint Type Abbreviation Pattern Type Size (bits) Selected (bits) CDK CDK Hash fingerprints 1024 1019 CDK Extended CDKExt Hash fingerprints 1024 1002 CDK Graph CDKGraph Hash fingerprints 1024 597 Estate Estate Structural features 79 22 MACCS MACCS Structural features 166 114 Pubchem Pubchem Structural features 881 221 Substructure FP4 Structural features 307 41 Substructure Count FP4C Structural features count 307 38 Klekota-Roth KR Structural features 4860 228 Klekota-Roth Count KRC Structural features count 4860 161 2D Atom Pairs AP2D Structural features 780 97 2D Atom Pairs Count AP2DC Structural features count 780 61 Feature Selection Feature selection is an essential procedure in data processing, and removing redundant features can improve the prediction performance, provide faster and more cost-effective predictors, and provide a better understanding of the underlying process that generated the data. In this study, we used the nearZeroVar function from the R package caret (version 6.0-71) (Kuhn et al., 2008) to filter out the less variable features. Because quite a few features just had a single value for all the compounds in the training set, such as the 2 features about whether the compounds contained C atoms or heavy metal, and we knew that all the compounds in this study contained more than 3 C atoms and did not contain any heavy metal. Thus these features do not make sense for the classification of hepatotoxic compounds. After that, we used the findCorrelation function to filter out several highly correlated features (Pearson’s correlation coefficients > 0.9). Because plenty of fingerprints are inherently correlated, such as the 2 fingerprints about whether the compounds contained C atoms or C–H bonds. It is known that if a compound contained C atoms, it very likely also contained C–H bonds. The remaining features (bits) for each molecular fingerprint are summarized in Table 1. Model Building The SVM, random forest (RF), and extreme gradient boosting (XGBoost) algorithms were all executed in R (version 3.3.1) using the kernlab (version 0.9-25) (Karatzoglou et al., 2004), randomForest (version 4.6-12) (Liaw and Wiener, 2001), and xgboost (version 0.4-4) (Chen and Guestrin, 2016) packages, respectively. A description of the basic theory of each algorithm and how they were used is provided in our previous study (Zhang et al., 2017). Briefly, these algorithms are machine learning methods that can accommodate multiple features efficiently. Support vector machine An SVM is an efficient machine learning method based on statistical learning theory that analyzes the data used for classification and regression analysis. This algorithm maps the features of the input data into a much higher-dimensional space through several kernel functions and constructs a hyperplane or set of hyperplanes in a high- or infinite-dimensional space to separate positive and negative. In this study, we used the radial basis kernel function to build the SVM models. Besides, the regularization parameter C and the kernel width parameter gamma were optimized through the random search method (Bergstra and Bengio, 2012), which was implemented in the caret package. Random forest Random forest is an ensemble learning method that operates through constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes or mean prediction of the individual trees. Compared with Decision Trees, RF is more robust and could deal with large amounts of data without overfitting (Hastie et al., 2001). The number of trees in the forest (ntree) and the number of features randomly sampled (mtry) are the most vital parameters, and we assigned 500 to ntree as we assigned the square root of the number of features in the dataset to mtry. In addition, we used the important function in the randomForest package to identify the feature importance of each type of molecular fingerprint. Extreme gradient boosting XGBoost is an efficient and scalable implementation of the gradient boosting algorithm. It uses clever penalization of the individual trees, and the trees are consequently allowed to have varying number of terminal nodes (Nielsen, 2016; Sheridan et al., 2016). It turns out that XGBoost achieves higher accuracy, whereas it requires less computing time. In this model, there are 4 important parameters, named eta (the step size shrinkage), max.depth (maximum depth of tree), min.child.weight (minimum sum of instance weight) and nrounds (the maximum number of iterations), which were optimized by the caret package. Ensemble learning methods are proposed by fusing multiple base models via voting or averaging, and obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone (Rokach, 2010). Especially, ensemble learning methods tend to generate much better results if there exists a significant diversity among different models. Recently these methods have been widely used in many fields, including toxicity prediction. Thirty-six base classifiers were built using the 3 machine learning algorithms and 12 molecular fingerprints. In this study, 35 ensemble models were proposed using these base classifiers. The best n (n = 36, 35, 34, 33…4, 3, 2) base classifiers with better predictive performance were fused to form the ensemble model by averaging the probabilities from the base classifiers. Figure 1 displays a flowchart of the ensemble model-building process. Figure 1. View largeDownload slide Flowchart of the ensemble model-building process. Figure 1. View largeDownload slide Flowchart of the ensemble model-building process. Performance Evaluation The predictive performance of the models will be evaluated by 5-fold cross-validation with 100 repeats and external validation. In 5-fold cross-validation, the original sample is randomly divided into 5 equal subsamples. Of the 5 subsamples, 4 ones are used as training data, and the remaining 1 is used as the validation for testing the model. The cross-validation process is then repeated 5 times, with each of the 5 subsamples used exactly once as the validation data (Arlot and Celisse, 2010). Besides, repeating this whole process for 100 times aims to reduce the randomness of the results and conduct a robust performance evaluation. The following 4 indicators were used to assess the predictive performance of the models: accuracy (Q), the overall prediction accuracy of hepatotoxicants and nonhepatotoxicants; sensitivity (SE), the prediction accuracy for hepatotoxicants; specificity (SP), the prediction accuracy for nonhepatotoxicants; the area under the receiver-operating characteristic curve (AUC). These indicators were calculated as follows: Q=TP+TNTP+TN+FN+FP×100%, (1) SE=TPTP+FN×100%, (2) SP=TNTN+FP×100%, (3) where true positive (TP) is the number of the hepatotoxicants that are correctly predicted, true negatives (TN) is the number of the nonhepatotoxicants that are correctly predicted, false positive (FP) is the number of the nonhepatotoxicants that are wrongly predicted as hepatotoxicants, false negative (FN) is the number of the hepatotoxicants that are wrongly predicted as nonhepatotoxicants. The receiver-operating characteristic (ROC) curve is a graphical plot of the TP rate (SE) against the FP rate (1 −SP) for the different possible cutoff points of a diagnostic test. The AUC was calculated as an important indicator for the prediction ability of the model. RESULTS Twelve types of molecular fingerprints and 3 machine learning methods were used to predict the hepatotoxicity of diverse organic compounds. Thirty-six base classifiers were generated, and their performances were evaluated by 5-fold cross-validation. The results are presented in Table 2. The results indicate that the SVM algorithm can achieve slightly higher SE whereas yielding relatively lower SP than other algorithms. In the training set, the accuracy (Q) ranged from 62.7% to 70.2%, the area under ROC curve (AUC) ranged from 0.674 to 0.748, the SE ranged from 66.2% to 81.6%, and the SP ranged from 40.7% to 62.6%. The most accurate classifier was generated by the SVM method using CDKExt fingerprints with a Q score of 70.2%, and the highest AUC value was produced by the RF method using FP4C fingerprints with an AUC score of 0.748. We conclude that different algorithms and fingerprints have different advantages and disadvantages in predicting hepatotoxicity. Table 2. Performance of the Base Classifiers in 5-fold Cross-validation Algorithms Fingerprints Q (%) SE (%) SP (%) AUC SVM CDK 69.6±2.6 78.5±3.8 58.6±4.5 0.744±0.028 CDKExt 70.2±2.6 79.3±3.8 59.1±4.7 0.747±0.027 CDKGraph 66.1±2.7 75.5±4.0 54.6±4.9 0.717±0.029 Estate 63.9±2.6 75.9±4.1 49.1±4.9 0.674±0.030 MACCS 68.7±2.6 77.3±4.0 58.3±4.8 0.733±0.027 Pubchem 69.4±2.6 77.5±3.9 59.6±4.8 0.742±0.028 FP4 66.8±2.6 75.7±4.2 55.8±4.5 0.713±0.029 FP4C 66.2±2.6 75.9±4.3 54.3±5.0 0.702±0.029 KR 66.9±2.6 76.3±4.3 55.3±4.7 0.723±0.028 KRC 65.8±2.7 72.7±5.1 57.4±5.3 0.710±0.028 AP2D 65.8±2.7 78.2±3.9 50.6±4.5 0.699±0.029 AP2DC 63.2±2.5 81.6±3.7 40.7±4.3 0.669±0.030 RF CDK 68.5±2.5 77.6±3.7 57.3±4.5 0.735±0.027 CDKExt 68.6±2.6 78.4±3.5 56.6±4.9 0.734±0.027 CDKGraph 66.7±2.7 74.4±4.1 57.2±4.7 0.717±0.029 Estate 64.6±2.7 69.4±4.1 58.8±4.7 0.692±0.029 MACCS 68.8±2.6 74.5±3.8 61.7±4.5 0.738±0.028 Pubchem 68.4±2.8 75.2±3.7 60.1±4.5 0.738±0.027 FP4 66.9±2.6 73.6±4.0 58.8±4.7 0.716±0.028 FP4C 69.4±2.5 75.8±3.4 61.5±4.6 0.748±0.026 KR 66.7±2.7 73.6±4.1 58.3±4.8 0.728±0.028 KRC 67.8±2.8 74.3±3.9 59.8±4.7 0.739±0.026 AP2D 65.8±2.7 73.0±3.9 57.1±4.5 0.710±0.028 AP2DC 65.7±2.9 71.7±4.3 58.4±4.8 0.701±0.031 XGBoost CDK 68.3±2.6 75.1±3.9 60.0±4.6 0.737±0.028 CDKExt 68.5±2.7 76.1±3.7 59.1±5.0 0.738±0.027 CDKGraph 66.1±2.6 71.6±4.2 59.4±4.6 0.720±0.027 Estate 62.7±2.9 66.2±4.4 58.4±4.9 0.675±0.030 MACCS 66.9±2.6 70.9±3.8 62.0±4.7 0.721±0.028 Pubchem 68.7±2.6 73.7±4.0 62.6±4.5 0.737±0.029 FP4 66.3±2.8 73.5±4.0 57.3±4.8 0.712±0.031 FP4C 68.5±2.4 73.8±3.6 62.0±4.3 0.733±0.027 KR 66.3±2.8 73.6±4.0 57.2±4.8 0.714±0.028 KRC 67.2±2.6 73.9±3.9 58.9±4.5 0.721±0.027 AP2D 64.9±2.8 70.9±4.1 57.6±4.6 0.694±0.030 AP2DC 64.3±2.7 69.4±4.3 58.0±4.6 0.683±0.029 Algorithms Fingerprints Q (%) SE (%) SP (%) AUC SVM CDK 69.6±2.6 78.5±3.8 58.6±4.5 0.744±0.028 CDKExt 70.2±2.6 79.3±3.8 59.1±4.7 0.747±0.027 CDKGraph 66.1±2.7 75.5±4.0 54.6±4.9 0.717±0.029 Estate 63.9±2.6 75.9±4.1 49.1±4.9 0.674±0.030 MACCS 68.7±2.6 77.3±4.0 58.3±4.8 0.733±0.027 Pubchem 69.4±2.6 77.5±3.9 59.6±4.8 0.742±0.028 FP4 66.8±2.6 75.7±4.2 55.8±4.5 0.713±0.029 FP4C 66.2±2.6 75.9±4.3 54.3±5.0 0.702±0.029 KR 66.9±2.6 76.3±4.3 55.3±4.7 0.723±0.028 KRC 65.8±2.7 72.7±5.1 57.4±5.3 0.710±0.028 AP2D 65.8±2.7 78.2±3.9 50.6±4.5 0.699±0.029 AP2DC 63.2±2.5 81.6±3.7 40.7±4.3 0.669±0.030 RF CDK 68.5±2.5 77.6±3.7 57.3±4.5 0.735±0.027 CDKExt 68.6±2.6 78.4±3.5 56.6±4.9 0.734±0.027 CDKGraph 66.7±2.7 74.4±4.1 57.2±4.7 0.717±0.029 Estate 64.6±2.7 69.4±4.1 58.8±4.7 0.692±0.029 MACCS 68.8±2.6 74.5±3.8 61.7±4.5 0.738±0.028 Pubchem 68.4±2.8 75.2±3.7 60.1±4.5 0.738±0.027 FP4 66.9±2.6 73.6±4.0 58.8±4.7 0.716±0.028 FP4C 69.4±2.5 75.8±3.4 61.5±4.6 0.748±0.026 KR 66.7±2.7 73.6±4.1 58.3±4.8 0.728±0.028 KRC 67.8±2.8 74.3±3.9 59.8±4.7 0.739±0.026 AP2D 65.8±2.7 73.0±3.9 57.1±4.5 0.710±0.028 AP2DC 65.7±2.9 71.7±4.3 58.4±4.8 0.701±0.031 XGBoost CDK 68.3±2.6 75.1±3.9 60.0±4.6 0.737±0.028 CDKExt 68.5±2.7 76.1±3.7 59.1±5.0 0.738±0.027 CDKGraph 66.1±2.6 71.6±4.2 59.4±4.6 0.720±0.027 Estate 62.7±2.9 66.2±4.4 58.4±4.9 0.675±0.030 MACCS 66.9±2.6 70.9±3.8 62.0±4.7 0.721±0.028 Pubchem 68.7±2.6 73.7±4.0 62.6±4.5 0.737±0.029 FP4 66.3±2.8 73.5±4.0 57.3±4.8 0.712±0.031 FP4C 68.5±2.4 73.8±3.6 62.0±4.3 0.733±0.027 KR 66.3±2.8 73.6±4.0 57.2±4.8 0.714±0.028 KRC 67.2±2.6 73.9±3.9 58.9±4.5 0.721±0.027 AP2D 64.9±2.8 70.9±4.1 57.6±4.6 0.694±0.030 AP2DC 64.3±2.7 69.4±4.3 58.0±4.6 0.683±0.029 The performance values are represented as means and standard deviation. Abbreviations: SVM, support vector machine; RF, random forest; XGBoost, extreme gradient boosting; Q, accuracy; SE, sensitivity; SP, specificity; AUC, area under the curve. Table 2. Performance of the Base Classifiers in 5-fold Cross-validation Algorithms Fingerprints Q (%) SE (%) SP (%) AUC SVM CDK 69.6±2.6 78.5±3.8 58.6±4.5 0.744±0.028 CDKExt 70.2±2.6 79.3±3.8 59.1±4.7 0.747±0.027 CDKGraph 66.1±2.7 75.5±4.0 54.6±4.9 0.717±0.029 Estate 63.9±2.6 75.9±4.1 49.1±4.9 0.674±0.030 MACCS 68.7±2.6 77.3±4.0 58.3±4.8 0.733±0.027 Pubchem 69.4±2.6 77.5±3.9 59.6±4.8 0.742±0.028 FP4 66.8±2.6 75.7±4.2 55.8±4.5 0.713±0.029 FP4C 66.2±2.6 75.9±4.3 54.3±5.0 0.702±0.029 KR 66.9±2.6 76.3±4.3 55.3±4.7 0.723±0.028 KRC 65.8±2.7 72.7±5.1 57.4±5.3 0.710±0.028 AP2D 65.8±2.7 78.2±3.9 50.6±4.5 0.699±0.029 AP2DC 63.2±2.5 81.6±3.7 40.7±4.3 0.669±0.030 RF CDK 68.5±2.5 77.6±3.7 57.3±4.5 0.735±0.027 CDKExt 68.6±2.6 78.4±3.5 56.6±4.9 0.734±0.027 CDKGraph 66.7±2.7 74.4±4.1 57.2±4.7 0.717±0.029 Estate 64.6±2.7 69.4±4.1 58.8±4.7 0.692±0.029 MACCS 68.8±2.6 74.5±3.8 61.7±4.5 0.738±0.028 Pubchem 68.4±2.8 75.2±3.7 60.1±4.5 0.738±0.027 FP4 66.9±2.6 73.6±4.0 58.8±4.7 0.716±0.028 FP4C 69.4±2.5 75.8±3.4 61.5±4.6 0.748±0.026 KR 66.7±2.7 73.6±4.1 58.3±4.8 0.728±0.028 KRC 67.8±2.8 74.3±3.9 59.8±4.7 0.739±0.026 AP2D 65.8±2.7 73.0±3.9 57.1±4.5 0.710±0.028 AP2DC 65.7±2.9 71.7±4.3 58.4±4.8 0.701±0.031 XGBoost CDK 68.3±2.6 75.1±3.9 60.0±4.6 0.737±0.028 CDKExt 68.5±2.7 76.1±3.7 59.1±5.0 0.738±0.027 CDKGraph 66.1±2.6 71.6±4.2 59.4±4.6 0.720±0.027 Estate 62.7±2.9 66.2±4.4 58.4±4.9 0.675±0.030 MACCS 66.9±2.6 70.9±3.8 62.0±4.7 0.721±0.028 Pubchem 68.7±2.6 73.7±4.0 62.6±4.5 0.737±0.029 FP4 66.3±2.8 73.5±4.0 57.3±4.8 0.712±0.031 FP4C 68.5±2.4 73.8±3.6 62.0±4.3 0.733±0.027 KR 66.3±2.8 73.6±4.0 57.2±4.8 0.714±0.028 KRC 67.2±2.6 73.9±3.9 58.9±4.5 0.721±0.027 AP2D 64.9±2.8 70.9±4.1 57.6±4.6 0.694±0.030 AP2DC 64.3±2.7 69.4±4.3 58.0±4.6 0.683±0.029 Algorithms Fingerprints Q (%) SE (%) SP (%) AUC SVM CDK 69.6±2.6 78.5±3.8 58.6±4.5 0.744±0.028 CDKExt 70.2±2.6 79.3±3.8 59.1±4.7 0.747±0.027 CDKGraph 66.1±2.7 75.5±4.0 54.6±4.9 0.717±0.029 Estate 63.9±2.6 75.9±4.1 49.1±4.9 0.674±0.030 MACCS 68.7±2.6 77.3±4.0 58.3±4.8 0.733±0.027 Pubchem 69.4±2.6 77.5±3.9 59.6±4.8 0.742±0.028 FP4 66.8±2.6 75.7±4.2 55.8±4.5 0.713±0.029 FP4C 66.2±2.6 75.9±4.3 54.3±5.0 0.702±0.029 KR 66.9±2.6 76.3±4.3 55.3±4.7 0.723±0.028 KRC 65.8±2.7 72.7±5.1 57.4±5.3 0.710±0.028 AP2D 65.8±2.7 78.2±3.9 50.6±4.5 0.699±0.029 AP2DC 63.2±2.5 81.6±3.7 40.7±4.3 0.669±0.030 RF CDK 68.5±2.5 77.6±3.7 57.3±4.5 0.735±0.027 CDKExt 68.6±2.6 78.4±3.5 56.6±4.9 0.734±0.027 CDKGraph 66.7±2.7 74.4±4.1 57.2±4.7 0.717±0.029 Estate 64.6±2.7 69.4±4.1 58.8±4.7 0.692±0.029 MACCS 68.8±2.6 74.5±3.8 61.7±4.5 0.738±0.028 Pubchem 68.4±2.8 75.2±3.7 60.1±4.5 0.738±0.027 FP4 66.9±2.6 73.6±4.0 58.8±4.7 0.716±0.028 FP4C 69.4±2.5 75.8±3.4 61.5±4.6 0.748±0.026 KR 66.7±2.7 73.6±4.1 58.3±4.8 0.728±0.028 KRC 67.8±2.8 74.3±3.9 59.8±4.7 0.739±0.026 AP2D 65.8±2.7 73.0±3.9 57.1±4.5 0.710±0.028 AP2DC 65.7±2.9 71.7±4.3 58.4±4.8 0.701±0.031 XGBoost CDK 68.3±2.6 75.1±3.9 60.0±4.6 0.737±0.028 CDKExt 68.5±2.7 76.1±3.7 59.1±5.0 0.738±0.027 CDKGraph 66.1±2.6 71.6±4.2 59.4±4.6 0.720±0.027 Estate 62.7±2.9 66.2±4.4 58.4±4.9 0.675±0.030 MACCS 66.9±2.6 70.9±3.8 62.0±4.7 0.721±0.028 Pubchem 68.7±2.6 73.7±4.0 62.6±4.5 0.737±0.029 FP4 66.3±2.8 73.5±4.0 57.3±4.8 0.712±0.031 FP4C 68.5±2.4 73.8±3.6 62.0±4.3 0.733±0.027 KR 66.3±2.8 73.6±4.0 57.2±4.8 0.714±0.028 KRC 67.2±2.6 73.9±3.9 58.9±4.5 0.721±0.027 AP2D 64.9±2.8 70.9±4.1 57.6±4.6 0.694±0.030 AP2DC 64.3±2.7 69.4±4.3 58.0±4.6 0.683±0.029 The performance values are represented as means and standard deviation. Abbreviations: SVM, support vector machine; RF, random forest; XGBoost, extreme gradient boosting; Q, accuracy; SE, sensitivity; SP, specificity; AUC, area under the curve. To integrate the advantages of various algorithms and fingerprints, we built several ensemble models based on 36 base classifiers. We first ordered the 36 classifiers by AUC, and then selected n of the classifiers with highest AUC value and averaged their prediction probabilities to predict hepatotoxicity again. This process generated 35 new ensemble models, and the performance indicators of the 35 ensemble models in cross-validation are presented in Supplementary Table 3. As expected, all ensemble models achieved significantly higher accuracy and AUC than any of the base classifiers. Besides, almost all the sensitivities and specificities in ensemble models were slightly higher than them in single models, and almost all the variances were reduced in ensemble models. Due to the combination of the diversity and independence of different models, the ensemble models obtained better predictive performance and more stable predicting outcomes without unbalanced classification. The numbers of most accurate base classifiers and their corresponding AUC values are shown in Figure 2. The best ensemble model consisted of 5 base classifiers: RF using FP4C, SVM using CDKExt, SVM using CDK, SVM using Pubchem, and RF using KRC. We named the model the Ensemble Top-5 model, with performance indicators of 71.1% ± 2.6% accuracy, 79.9% ± 3.6% SE, 60.3% ± 4.8% SP, and 0.764 ± 0.026 AUC. Compared with the most accurate classifier built by the SVM algorithm using CDKExt and the highest AUC value classifier built by the RF algorithm using FP4C, the accuracy of the combined model improved by 0.9%, and the AUC improved by 0.016. This indicates that the ensemble method can improve the performance of hepatotoxicity predictions. Figure 2. View largeDownload slide The numbers of most accurate base classifiers and their corresponding area under the curve (AUC) values. Figure 2. View largeDownload slide The numbers of most accurate base classifiers and their corresponding area under the curve (AUC) values. To assess the performance of our ensemble model in external validation, we used a dataset of 286 compounds from the LTKB database. These compounds were not included in the training set and were not involved in the model-building process, therefore the results could objectively reflect the ensemble model’s ability to predict hepatotoxicity of chemical compounds. The ensemble model yielded an accuracy of 84.3%, SE of 86.9%, SP of 75.4%, and AUC of 0.904. For comparison, we also evaluated the 36 base classifiers using the external validation, the results are presented in Supplementary Table 4. The results show that using single models to predict hepatotoxicity of compounds in LTKB has gained high accuracy and our Ensemble Top-5 model achieved higher accuracy and AUC than almost all the single models. Unlike the cross-validation, because of the occasionality, the performance metrics obtained using the external test set containing fewer compounds are not accurate enough. It is certainly that very few single models achieved better performance than the ensemble model in the external validation. This shows that our Ensemble Top-5 model can be not only efficient but also stable in predicting the hepatotoxicity of chemical compounds. We built a user-friendly web server, HepatoPred-EL, to provide convenient access to our models (http://ccsipb.lnu.edu.cn/toxicity/HepatoPred-EL/). DISCUSSION Comparison with Previous Methods Various computational methods for hepatotoxicity prediction have been reported (Chen et al., 2013b; Ekins et al., 2010; Liew et al., 2011; Zhang et al., 2016a,b). To ensure the precision of our analysis, we selected only QSAR methods that have been cross-validated for comparison. The main indicators of these methods are summarized in Table 3. Because datasets and feature selections vary, a direct comparison of methods can be flawed; nevertheless, we could still obtain some information by comparison. We drew the following conclusions from the comparison: (1) The accuracy of our model is higher than previous methods. But none of these methods offer high accuracy, as most of these accuracies are below 70%, this may due to the complexity of hepatotoxicity. (2) Our Ensemble Top-5 model achieved relatively high SE. Sensitivity reflects the ability to correctly identify hepatotoxicity, and is considered a more important indicator in assessing the quality of a classifier for hepatotoxicity prediction, because the superior ability to identify hepatotoxicants can warn drug developers of the potential adverseness of candidate compounds so that they could conduct further verification during the drug development. But on the other hand, this model with high SE and poor accuracy could also lead to a high risk of loss of good drugs, thus toxicity researchers and drug developers should balance the trade-off between the risk of losing good drugs and selecting toxic drugs. To address this problem, our model not only predicts whether the compound is hepatotoxic or not, but also provides the probability that the compound belongs to hepatotoxicants. Researchers can set appropriate thresholds based on the diseases they concentrate on and the purposes of these researches. For example, the default threshold of our model is set to 0.5, which means that the compounds with probabilities of hepatotoxicity greater than 0.5 would be classified as hepatotoxicants, and the others would be classified as nonhepatotoxicants. For researchers who focus more on toxicity, this threshold could be reduced, and for others who focus more on activity, this threshold could be increased. And these probabilities of hepatotoxicity have already been included in the prediction results of the web server we built for this model. (3) We used the LTKB database as our external dataset, and achieved an accuracy of 84.3%; Zhang et al. (2016b) achieved an accuracy of 72.6% using a dataset of 420 compounds, which contains 286 compounds from the LTKB database. By comparison, our model highly likely shows better performance. Nevertheless, DILI is indeed a very complex toxic endpoint and its complexity can cause inaccuracies in the computational models, and the essence of a positive or negative DILI signal for a drug relies heavily on the context of the disease being treated, the dose and duration intended. But under the current modeling technology, we could not fully consider the complexity of hepatotoxicity in the developing of the models. Although it is confirmed that our research could provide an excellent reference for the drug development, more efforts are needed to explore in this field. Table 3. Performance Indicators of Several Hepatotoxicity Prediction Models Reported in the Literature Model Name No. of Compounds Test Method Q (%) SE (%) SP (%) AUC References Bayesian 295 10-fold CV×100 58.5 52.8 65.5 0.62 Ekins et al. (2010) Mixed 1087 5-fold CV 63.8 64.1 63.3 0.676 Liew et al. (2011) Decision Forest 197 10-fold CV×2000 69.7 57.8 77.9 – Chen et al. (2013b) SVM 1317 Test set 66.5 92.9 24.0 0.651 Zhang et al. (2016a) Naïve Bayesian 420 Test set 72.6 72.5 72.7 – Zhang et al. (2016b) Ensemble-Top5 1241 5-fold CV×100 71.1 79.9 60.3 0.764 Present study Test set 84.3 86.9 75.4 0.904 Model Name No. of Compounds Test Method Q (%) SE (%) SP (%) AUC References Bayesian 295 10-fold CV×100 58.5 52.8 65.5 0.62 Ekins et al. (2010) Mixed 1087 5-fold CV 63.8 64.1 63.3 0.676 Liew et al. (2011) Decision Forest 197 10-fold CV×2000 69.7 57.8 77.9 – Chen et al. (2013b) SVM 1317 Test set 66.5 92.9 24.0 0.651 Zhang et al. (2016a) Naïve Bayesian 420 Test set 72.6 72.5 72.7 – Zhang et al. (2016b) Ensemble-Top5 1241 5-fold CV×100 71.1 79.9 60.3 0.764 Present study Test set 84.3 86.9 75.4 0.904 Abbreviations: SVM, support vector machine; Q, accuracy; SE, sensitivity; SP, specificity; AUC, area under the curve. Table 3. Performance Indicators of Several Hepatotoxicity Prediction Models Reported in the Literature Model Name No. of Compounds Test Method Q (%) SE (%) SP (%) AUC References Bayesian 295 10-fold CV×100 58.5 52.8 65.5 0.62 Ekins et al. (2010) Mixed 1087 5-fold CV 63.8 64.1 63.3 0.676 Liew et al. (2011) Decision Forest 197 10-fold CV×2000 69.7 57.8 77.9 – Chen et al. (2013b) SVM 1317 Test set 66.5 92.9 24.0 0.651 Zhang et al. (2016a) Naïve Bayesian 420 Test set 72.6 72.5 72.7 – Zhang et al. (2016b) Ensemble-Top5 1241 5-fold CV×100 71.1 79.9 60.3 0.764 Present study Test set 84.3 86.9 75.4 0.904 Model Name No. of Compounds Test Method Q (%) SE (%) SP (%) AUC References Bayesian 295 10-fold CV×100 58.5 52.8 65.5 0.62 Ekins et al. (2010) Mixed 1087 5-fold CV 63.8 64.1 63.3 0.676 Liew et al. (2011) Decision Forest 197 10-fold CV×2000 69.7 57.8 77.9 – Chen et al. (2013b) SVM 1317 Test set 66.5 92.9 24.0 0.651 Zhang et al. (2016a) Naïve Bayesian 420 Test set 72.6 72.5 72.7 – Zhang et al. (2016b) Ensemble-Top5 1241 5-fold CV×100 71.1 79.9 60.3 0.764 Present study Test set 84.3 86.9 75.4 0.904 Abbreviations: SVM, support vector machine; Q, accuracy; SE, sensitivity; SP, specificity; AUC, area under the curve. Substructures Related to Hepatotoxicity To better understand the contribution of several substructures related to hepatotoxicity prediction, the mean decrease in the Gini coefficient was used to assess the importance of these substructures with the RF algorithm (Siroky, 2009). In this study, every bit in molecular fingerprints corresponded to a chemical group, and we queried their fingerprint keys and the corresponding SMARTS patterns in the official documentation of PaDEL-Descriptor software. The important bits could indicate the risk of the hepatotoxicity of chemicals and offer reference value for drug screening and medicine development, thus we analyzed the important bits in the molecular fingerprints. Because CDK, CDKExt, and CDKGraph are hash fingerprints, and FP4C, KRC, and AP2DC are structural features count fingerprints, we calculated the mean decreases in the Gini coefficient using Estate, MACCS, Pubchem, FP4, KR, and AP2D fingerprints. For each type of fingerprint, we selected the 10 most important features based on mean decreases in the Gini coefficient (Figure 3). The higher the mean decrease in the Gini coefficient, the closer the feature is to hepatotoxicity prediction. The fingerprint keys and the substructures’ occurrence in hepatotoxicants and nonhepatotoxicants are presented in Table 4. Almost every feature appeared in more than half of all compounds in the 1241-compound dataset, indicating that these features are significant in the prediction process. With the exception of KR-298, all these features appeared more frequently in hepatotoxicants than in nonhepatotoxicants, implying that hepatotoxicants and nonhepatotoxicants differ in structure, and hepatotoxicants may have several particular substructures. This may also explain why the SE of our model was much higher than the SP. We therefore recommend that these substructures be taken into account when designing therapeutic compounds. Table 4. Top-Ranking Substructures and Their Occurrence in Hepatotoxicants and Nonhepatotoxicants Fingerprint Key Description SMARTS Pattern Present in Hepatotoxicants Present in Nonhepatotoxicants AP2D-102 O–O at topological distance 2 [#8]–[#8] 365 263 Estate-34 sOH [OD1H]−* 393 311 KR-298 –CHCH2< [!#1][CH2][CH]([!#1])[!#1] 296 336 MACCS-153 QCH2A [!#6;!#1]–[CH2]–* 432 430 Pubchem-257 more than 2 aromatic rings aromatic rings ≥2 344 218 FP4-287 Conjugated double bond *=*[*]=,#,:[*] 378 222 Fingerprint Key Description SMARTS Pattern Present in Hepatotoxicants Present in Nonhepatotoxicants AP2D-102 O–O at topological distance 2 [#8]–[#8] 365 263 Estate-34 sOH [OD1H]−* 393 311 KR-298 –CHCH2< [!#1][CH2][CH]([!#1])[!#1] 296 336 MACCS-153 QCH2A [!#6;!#1]–[CH2]–* 432 430 Pubchem-257 more than 2 aromatic rings aromatic rings ≥2 344 218 FP4-287 Conjugated double bond *=*[*]=,#,:[*] 378 222 Table 4. Top-Ranking Substructures and Their Occurrence in Hepatotoxicants and Nonhepatotoxicants Fingerprint Key Description SMARTS Pattern Present in Hepatotoxicants Present in Nonhepatotoxicants AP2D-102 O–O at topological distance 2 [#8]–[#8] 365 263 Estate-34 sOH [OD1H]−* 393 311 KR-298 –CHCH2< [!#1][CH2][CH]([!#1])[!#1] 296 336 MACCS-153 QCH2A [!#6;!#1]–[CH2]–* 432 430 Pubchem-257 more than 2 aromatic rings aromatic rings ≥2 344 218 FP4-287 Conjugated double bond *=*[*]=,#,:[*] 378 222 Fingerprint Key Description SMARTS Pattern Present in Hepatotoxicants Present in Nonhepatotoxicants AP2D-102 O–O at topological distance 2 [#8]–[#8] 365 263 Estate-34 sOH [OD1H]−* 393 311 KR-298 –CHCH2< [!#1][CH2][CH]([!#1])[!#1] 296 336 MACCS-153 QCH2A [!#6;!#1]–[CH2]–* 432 430 Pubchem-257 more than 2 aromatic rings aromatic rings ≥2 344 218 FP4-287 Conjugated double bond *=*[*]=,#,:[*] 378 222 Figure 3. View largeDownload slide The 10 most important features from each random forest (RF) model trained with Estate, MACCS, Pubchem, FP4, KR, and AP2D fingerprints. The mean decreases in the Gini coefficient values are represented as means and standard deviation. Figure 3. View largeDownload slide The 10 most important features from each random forest (RF) model trained with Estate, MACCS, Pubchem, FP4, KR, and AP2D fingerprints. The mean decreases in the Gini coefficient values are represented as means and standard deviation. CONCLUSION We used 3 machine learning algorithms and 12 molecular fingerprints of 1241 diverse compounds to build base prediction models of DILI. The accuracy ranged from 62.7% to 70.2% and the AUC ranged from 0.674 to 0.748. To improve the prediction performance, we used these base models to develop 35 ensemble models, all of which achieve higher accuracy than any base model, and the Ensemble Top-5 model achieves an accuracy of 71.1% ± 2.6%, SE of 79.9% ± 3.6%, SP of 60.3% ± 4.8%, and AUC of 0.764 ± 0.026 in 5-fold cross-validation and accuracy of 84.3%, SE of 86.9%, SP of 75.4%, and AUC of 0.904 in external validation. This indicates that our Ensemble Top-5 model can significantly improve the performance of hepatotoxicity prediction. Compared with previous methods, our Ensemble Top-5 model achieves relatively high accuracy and SE, and may be effective in assessing DILI risk during the early stages of drug discovery. SUPPLEMENTARY DATA Supplementary data are available at Toxicological Sciences online. ACKNOWLEDGMENTS The authors are grateful for the support from Engineering Laboratory for Molecular Simulation and Designing of Drug Molecules of Liaoning and Research Center for Computer Simulating and Information Processing of Bio-macromolecules of Liaoning Province. FUNDING National Natural Science Foundation of China (No. 31570160); Innovation Team Project (No. LT2015011) of Education Department of Liaoning Province, Important Scientific and Technical Achievements Transformation Project (No. Z17-5-078); and Applied Basic Research Project (Nos F16205151 and 17231104) of Science and Technology Bureau of Shenyang. REFERENCES Ai H. , Zhang L. , Chang A. K. , Wei H. , Che Y. , Liu H. ( 2014 ). Virtual screening of potential inhibitors from TCM for the CPSF30 binding site on the NS1A protein of influenza A virus . J. Mol. Model. 20 , 2142. Google Scholar Crossref Search ADS PubMed Ai H. , Zheng F. , Zhu C. , Sun T. , Li Z. , Liu X. , Li X. , Zhu G. , Liu H. ( 2010 ). Discovery of novel influenza inhibitors targeting the interaction of dsRNA with the NS1 protein by structure-based virtual screening . Int. J. Bioinform. Res. Appl. 6 , 449 – 460 . Google Scholar Crossref Search ADS PubMed Arlot S. , Celisse A. ( 2010 ). A survey of cross-validation procedures for model selection . Stat. Surv. 4 , 40 – 79 . Google Scholar Crossref Search ADS Bergstra J. , Bengio Y. ( 2012 ). Random search for hyper-parameter optimization . Journal of Machine Learning Research , 13 1 , 281 – 305 . Björnsson E. , Nordlinder H. , Olsson R. ( 2006 ). Clinical characteristics and prognostic markers in disulfiram-induced liver injury . J. Hepatol. 44 , 791. Google Scholar Crossref Search ADS PubMed Chen M. , Hong H. , Fang H. , Kelly R. , Zhou G. , Borlak J. , Tong W. ( 2013b ). Quantitative structure-activity relationship models for predicting drug-induced liver injury based on FDA-approved drug labeling annotation and using a large collection of drugs . Toxicol. Sci. 136 , 242. Google Scholar Crossref Search ADS Chen M. , Vijay V. , Shi Q. , Liu Z. , Fang H. , Tong W. ( 2011 ). FDA-approved drug labeling for the study of drug-induced liver injury . Drug Discov. Today 16 , 697 – 703 . Google Scholar Crossref Search ADS PubMed Chen M. , Zhang J. , Wang Y. , Liu Z. , Kelly R. , Zhou G. , Fang H. , Borlak J. , Tong W. ( 2013a ). The liver toxicity knowledge base: a systems approach to a complex end point . Clin. Pharmacol. Ther. 93 , 409 – 412 . Google Scholar Crossref Search ADS Chen T. , Guestrin C. ( 2016 ) XGBoost: A Scalable Tree Boosting System, ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp.785–794). ACM. Cheng F. , Li W. , Liu G. , Tang Y. ( 2013 ). In silico admet prediction: recent advances, current challenges and future trends . Curr. Top. Med. Chem. 13 , 1273 – 1289 . Google Scholar Crossref Search ADS PubMed Dearden J. C. ( 2016 ). The history and development of quantitative structure-activity relationships (QSARs) . Int. J. Qual. Struct.-Prop. Relat . 1 , 1 – 44 . Ekins S. ( 2014 ). Progress in computational toxicology . J. Pharmacol. Toxicol. Methods 69 , 115 – 140 . Google Scholar Crossref Search ADS PubMed Ekins S. , Williams A. J. , Xu J. J. ( 2010 ). A predictive ligand-based Bayesian model for human drug-induced liver injury . Drug Metab. Dispos. 38 , 2302 – 2308 . Google Scholar Crossref Search ADS PubMed Golbraikh A. , Tropsha A. ( 2002 ). Predictive QSAR modeling based on diversity sampling of experimental datasets for the training and test set selection . Mol. Diver. 5 , 231 – 243 . Google Scholar Crossref Search ADS Hastie T. , Tibshirani R. , Friedman J. H. ( 2001 ). Elements of Statistical Learning , Vol. 167 , pp. 192 – 192 . Springer , Berlin . Jennen D. , Polman J. , Bessem M. , Coonen M. , Delft J. V. , Kleinjans J. ( 2014 ). Drug-induced liver injury classification model based on in vitro human transcriptomics and in vivo rat clinical chemistry data . Syst. Biomed. 2 , 63 – 70 . Google Scholar Crossref Search ADS Karatzoglou A. , Smola A. , Hornik K. , Zeileis A. ( 2004 ). kernlab—An S4 Package for Kernel Methods in R . J. Stat. Softw. 11 , 721 – 729 . Google Scholar Crossref Search ADS Kuhn M. , Leeuw J. D. , Zeileis A. ( 2008 ). Building predictive models in R using the caret package . J. Stat. Softw. 28 , 1 – 26 . Google Scholar Crossref Search ADS PubMed Liaw A. , Wiener M. ( 2001 ). Classification and regression by RandomForest . R News 2 , 18 – 22 . Liew C. Y. , Lim Y. C. , Yap C. W. ( 2011 ). Mixed learning algorithms and features ensemble in hepatotoxicity prediction . J. Comput. Aided Mol. Des. 25 , 855. Google Scholar Crossref Search ADS PubMed Marzorati M. , Wittebolle L. , Boon N. , Daffonchio D. , Verstraete W. ( 2008 ). How to get more out of molecular fingerprints: practical tools for microbial ecology . Environ. Microbiol. 10 , 1571 – 1581 . Google Scholar Crossref Search ADS PubMed Merlot C. ( 2010 ). Computational toxicology—A tool for early safety evaluation . Drug Discov. Today 15 , 16. Google Scholar Crossref Search ADS PubMed Muster W. , Breidenbach A. , Fischer H. , Kirchner S. , Müller L. , Pähler A. ( 2008 ). Computational toxicology in drug development . Drug Discov. Today 13 , 303 – 310 . Google Scholar Crossref Search ADS PubMed Nielsen D. ( 2016 ). Tree boosting with XGBoost - why does XGBoost win “every” machine learning competition? (Master’s thesis, NTNU). Przybylak K. R. , Cronin M. T. ( 2012 ). In silico models for drug-induced liver injury—Current status . Expert Opin. Drug Metab. Toxicol. 8 , 201 – 217 . Google Scholar Crossref Search ADS PubMed Rokach L. ( 2010 ). Ensemble-based classifiers . Artif. Intell. Rev. 33 , 1 – 39 . Google Scholar Crossref Search ADS Segall M. D. , Barber C. ( 2014 ). Addressing toxicity risk when designing and selecting compounds in early drug discovery . Drug Discov. Today 19 , 688 – 693 . Google Scholar Crossref Search ADS PubMed Sheridan R. P. , Wang W. M. , Liaw A. , Ma J. , Gifford E. M. ( 2016 ). Extreme gradient boosting as a method for quantitative structure-activity relationships . J. Chem. Inf. Model , 56 12 , 2353 . Siroky D. S. ( 2009 ). Navigating random forests and related advances in algorithmic modeling . Stat. Surv. 3 , 147 – 163 . Google Scholar Crossref Search ADS Tomida T. , Okamura H. , Satsukawa M. , Yokoi T. , Konno Y. ( 2015 ). Multiparametric assay using HepaRG cells for predicting drug-induced liver injury . Toxicol. Lett. 236 , 16 – 24 . Google Scholar Crossref Search ADS PubMed Yap C. W. ( 2011 ). PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints . J. Comput. Chem. 32 , 1466 – 1474 . Google Scholar Crossref Search ADS PubMed Zhang C. , Cheng F. , Li W. , Liu G. , Lee P. W. , Tang Y. ( 2016a ). In silico prediction of drug induced liver toxicity using substructure pattern recognition method . Mol. Inform. 35 , 136. Zhang H. , Ding L. , Zou Y. , Hu S. Q. , Huang H. G. , Kong W. B. , Zhang J. ( 2016 ). Predicting drug-induced liver injury in human with Naïve Bayes classifier approach . J. Comput. Aided Mol. Des. 30 , 889–810. Zhang L. , Ai H. , Chen W. , Yin Z. , Hu H. , Zhu J. , Zhao J. , Zhao Q. , Liu H. ( 2017 ). CarcinoPred-EL: novel models for predicting the carcinogenicity of chemicals using molecular fingerprints and ensemble learning methods . Sci. Rep. 7 , 2118. Google Scholar Crossref Search ADS PubMed Zhu X. , Kruhlak N. L. ( 2014 ). Construction and analysis of a human hepatotoxicity database suitable for QSAR modeling using post-market safety data . Toxicology 321 , 62 – 72 . Google Scholar Crossref Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press on behalf of the Society of Toxicology. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Toxicological Sciences Oxford University Press

Predicting Drug-Induced Liver Injury Using Ensemble Learning Methods and Molecular Fingerprints

Toxicological Sciences , Volume 165 (1) – Sep 1, 2018

Loading next page...
 
/lp/ou_press/predicting-drug-induced-liver-injury-using-ensemble-learning-methods-Ou5BQZCbC3
Publisher
Oxford University Press
Copyright
© The Author(s) 2018. Published by Oxford University Press on behalf of the Society of Toxicology. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com
ISSN
1096-6080
eISSN
1096-0929
D.O.I.
10.1093/toxsci/kfy121
Publisher site
See Article on Publisher Site

Abstract

Abstract Drug-induced liver injury (DILI) is a major safety concern in the drug-development process, and various methods have been proposed to predict the hepatotoxicity of compounds during the early stages of drug trials. In this study, we developed an ensemble model using 3 machine learning algorithms and 12 molecular fingerprints from a dataset containing 1241 diverse compounds. The ensemble model achieved an average accuracy of 71.1 ± 2.6%, sensitivity (SE) of 79.9 ± 3.6%, specificity (SP) of 60.3 ± 4.8%, and area under the receiver-operating characteristic curve (AUC) of 0.764 ± 0.026 in 5-fold cross-validation and an accuracy of 84.3%, SE of 86.9%, SP of 75.4%, and AUC of 0.904 in an external validation dataset of 286 compounds collected from the Liver Toxicity Knowledge Base. Compared with previous methods, the ensemble model achieved relatively high accuracy and SE. We also identified several substructures related to DILI. In addition, we provide a web server offering access to our models (http://ccsipb.lnu.edu.cn/toxicity/HepatoPred-EL/). DILI, hepatotoxicity, molecular fingerprints, machine learning, ensemble The liver is a vital organ of the body and plays an important role in metabolism. Drug-induced liver injury (DILI) is one of the leading causes of drug failure in trials and withdrawal from the market (Björnsson et al., 2006; Segall and Barber, 2014). Thus, determining the hepatotoxicity of compounds is essential. Over the past decades, various approaches have been developed to assess the risk of DILI, both in vivo and in vitro (Jennen et al., 2014; Tomida et al., 2015). However, these studies are expensive, time-consuming, and may not yield high correlations between experimental results and effects observed in humans. Therefore, computational approaches to predict the hepatotoxicity of compounds from chemical structure properties are gradually being recognized as an effective tool. In silico approaches are a low-cost, fast method to collect information on potential toxicity, and great efforts have been made in hepatotoxicity prediction in recent years (Ekins, 2014; Przybylak and Cronin, 2012). In particular, quantitative structure-activity relationship (QSAR) models (Ai et al., 2014, 2010; Dearden, 2016; Golbraikh and Tropsha, 2002; Muster et al., 2008; Zhu and Kruhlak, 2014), which aim to explain the relationships between chemical structure features (eg, molecular descriptors) and biologic activities (such as hepatotoxicity, mutagenicity, and carcinogenicity) based on known activity datasets using various statistical algorithms, could predict the potential hepatotoxicity of a new compound before its synthesis. Thus our model could be used in the early stage of drug development, which aims to filter out the compounds with the potential risk of hepatotoxicity before in vivo and in vitro research (Cheng et al., 2013; Merlot, 2010). Various QSAR models for predicting hepatotoxicity have been reported, most of which have used machine learning methods (Chen et al., 2013b; Ekins et al., 2010; Liew et al., 2011; Zhang et al., 2016a,b). Ekins et al. (2010) developed a Bayesian modeling method with extended connectivity fingerprints and other interpretable descriptors, based on a training set of 295 compounds, and applied it to 237 compounds for external validation, achieving an accuracy of 57%–59% on the training set and 60% on the external test. Liew et al. (2011) presented an ensemble model of 617 base classifiers which used support vector machine (SVM) and k-nearest neighbor (k-NN) methods based on a diverse set of 1087 compounds, and achieved an overall accuracy of 63.8% in 5-fold cross-validation and 75.0% in an external validation of 120 compounds. Zhang et al. (2016a) built a model using 5 machine learning methods based on MACCS and FP4 fingerprints after evaluation by substructure pattern recognition, and reported that the best model used SVM together with FP4 fingerprint index at an IG value threshold of 0.0005, and achieved an overall accuracy of 79.7% on the training set and 64.5% on the test set. Whereas these models are widely applicable, their accuracy in forecasting the hepatotoxicity of new compounds (estimated by cross-validation or external testing) remains unsatisfactory. Moreover, many models have not even been evaluated by an appropriate cross-validation. Molecular fingerprints have been used in drug development and toxicity prediction (Marzorati et al., 2008), including in some of the models above. In this study, we used 12 types of molecular fingerprints and 3 machine learning methods to predict the hepatotoxicity of diverse organic compounds. Furthermore, we built several ensemble methods that combine various models generated by a variety of fingerprint subsets and machine learning methods. To demonstrate the prediction capability and reliability of the models, the models will be evaluated by 5-fold cross-validation with 100 repeats and external validation. MATERIALS AND METHODS Data Preparation We derived the training set used to build the model from 2 papers: Liew et al. (2011) used 1274 compounds for the analysis and modeling process, most of which were included in the U.S. Food and Drug Administration (FDA) Orange Book of approved drug products with therapeutic equivalence evaluations. The authors had taken an extremely reserved approach in the labeling so that any drug with the potential to cause any adverse liver effects was flagged as “positive,” and others without potential were labeled as “negative.” Zhu and Kruhlak (2014) used a calibration set of 282 compounds included in United States and European toxicity registries. The authors used a classification approach similar to Liew’s, which was based on whether alerting publications or warnings related to liver injury were found, and labeled the 282 compounds from previous publications as “positive” and “negative.” To gain a robust training set, we integrated the 2 datasets whereas deleting duplicates and compounds containing fewer than 3 carbon atoms that provide oversimplified structure. Besides, we deleted drugs that appeared in the external validation dataset from the training set. At last, we had built a training dataset of 1241 compounds containing 683 positives and 558 negatives. Details of the training dataset are provided in Supplementary Table 1. To further evaluate the predictive performance of our models, we used the FDA’s Liver Toxicity Knowledge Base (LTKB) (Chen et al., 2011, 2013a) as the external validation dataset; this is a benchmark dataset containing 287 prescription drugs, including 137 drugs with known hepatotoxicity, 85 drugs, which may cause DILI, and 65 drugs with no DILI indications on their labels. We removed gemtuzumab from the dataset because there is no simplified molecular input entry specification (SMILES) format of it in the LTKB benchmark dataset, which we used to calculate the molecular fingerprints. Thus, we built an external validation dataset of 286 compounds, which contains 221 DILI positives and 65 DILI negatives. Details of the external validation dataset are provided in Supplementary Table 2. Calculation of Molecular Fingerprints Twelve types of molecular fingerprints were generated to represent the chemical structure features of the compounds. The names and dimensions of these fingerprints are summarized in Table 1. All fingerprints were calculated by the PaDEL-Descriptor software (version 2.21) using SMILES formats of all the compounds (Yap, 2011). Each bit of these molecular fingerprints was used as a feature in the machine learning process. Table 1. Summary of the 12 Types of Molecular Fingerprints Generated in the Study Fingerprint Type Abbreviation Pattern Type Size (bits) Selected (bits) CDK CDK Hash fingerprints 1024 1019 CDK Extended CDKExt Hash fingerprints 1024 1002 CDK Graph CDKGraph Hash fingerprints 1024 597 Estate Estate Structural features 79 22 MACCS MACCS Structural features 166 114 Pubchem Pubchem Structural features 881 221 Substructure FP4 Structural features 307 41 Substructure Count FP4C Structural features count 307 38 Klekota-Roth KR Structural features 4860 228 Klekota-Roth Count KRC Structural features count 4860 161 2D Atom Pairs AP2D Structural features 780 97 2D Atom Pairs Count AP2DC Structural features count 780 61 Fingerprint Type Abbreviation Pattern Type Size (bits) Selected (bits) CDK CDK Hash fingerprints 1024 1019 CDK Extended CDKExt Hash fingerprints 1024 1002 CDK Graph CDKGraph Hash fingerprints 1024 597 Estate Estate Structural features 79 22 MACCS MACCS Structural features 166 114 Pubchem Pubchem Structural features 881 221 Substructure FP4 Structural features 307 41 Substructure Count FP4C Structural features count 307 38 Klekota-Roth KR Structural features 4860 228 Klekota-Roth Count KRC Structural features count 4860 161 2D Atom Pairs AP2D Structural features 780 97 2D Atom Pairs Count AP2DC Structural features count 780 61 Table 1. Summary of the 12 Types of Molecular Fingerprints Generated in the Study Fingerprint Type Abbreviation Pattern Type Size (bits) Selected (bits) CDK CDK Hash fingerprints 1024 1019 CDK Extended CDKExt Hash fingerprints 1024 1002 CDK Graph CDKGraph Hash fingerprints 1024 597 Estate Estate Structural features 79 22 MACCS MACCS Structural features 166 114 Pubchem Pubchem Structural features 881 221 Substructure FP4 Structural features 307 41 Substructure Count FP4C Structural features count 307 38 Klekota-Roth KR Structural features 4860 228 Klekota-Roth Count KRC Structural features count 4860 161 2D Atom Pairs AP2D Structural features 780 97 2D Atom Pairs Count AP2DC Structural features count 780 61 Fingerprint Type Abbreviation Pattern Type Size (bits) Selected (bits) CDK CDK Hash fingerprints 1024 1019 CDK Extended CDKExt Hash fingerprints 1024 1002 CDK Graph CDKGraph Hash fingerprints 1024 597 Estate Estate Structural features 79 22 MACCS MACCS Structural features 166 114 Pubchem Pubchem Structural features 881 221 Substructure FP4 Structural features 307 41 Substructure Count FP4C Structural features count 307 38 Klekota-Roth KR Structural features 4860 228 Klekota-Roth Count KRC Structural features count 4860 161 2D Atom Pairs AP2D Structural features 780 97 2D Atom Pairs Count AP2DC Structural features count 780 61 Feature Selection Feature selection is an essential procedure in data processing, and removing redundant features can improve the prediction performance, provide faster and more cost-effective predictors, and provide a better understanding of the underlying process that generated the data. In this study, we used the nearZeroVar function from the R package caret (version 6.0-71) (Kuhn et al., 2008) to filter out the less variable features. Because quite a few features just had a single value for all the compounds in the training set, such as the 2 features about whether the compounds contained C atoms or heavy metal, and we knew that all the compounds in this study contained more than 3 C atoms and did not contain any heavy metal. Thus these features do not make sense for the classification of hepatotoxic compounds. After that, we used the findCorrelation function to filter out several highly correlated features (Pearson’s correlation coefficients > 0.9). Because plenty of fingerprints are inherently correlated, such as the 2 fingerprints about whether the compounds contained C atoms or C–H bonds. It is known that if a compound contained C atoms, it very likely also contained C–H bonds. The remaining features (bits) for each molecular fingerprint are summarized in Table 1. Model Building The SVM, random forest (RF), and extreme gradient boosting (XGBoost) algorithms were all executed in R (version 3.3.1) using the kernlab (version 0.9-25) (Karatzoglou et al., 2004), randomForest (version 4.6-12) (Liaw and Wiener, 2001), and xgboost (version 0.4-4) (Chen and Guestrin, 2016) packages, respectively. A description of the basic theory of each algorithm and how they were used is provided in our previous study (Zhang et al., 2017). Briefly, these algorithms are machine learning methods that can accommodate multiple features efficiently. Support vector machine An SVM is an efficient machine learning method based on statistical learning theory that analyzes the data used for classification and regression analysis. This algorithm maps the features of the input data into a much higher-dimensional space through several kernel functions and constructs a hyperplane or set of hyperplanes in a high- or infinite-dimensional space to separate positive and negative. In this study, we used the radial basis kernel function to build the SVM models. Besides, the regularization parameter C and the kernel width parameter gamma were optimized through the random search method (Bergstra and Bengio, 2012), which was implemented in the caret package. Random forest Random forest is an ensemble learning method that operates through constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes or mean prediction of the individual trees. Compared with Decision Trees, RF is more robust and could deal with large amounts of data without overfitting (Hastie et al., 2001). The number of trees in the forest (ntree) and the number of features randomly sampled (mtry) are the most vital parameters, and we assigned 500 to ntree as we assigned the square root of the number of features in the dataset to mtry. In addition, we used the important function in the randomForest package to identify the feature importance of each type of molecular fingerprint. Extreme gradient boosting XGBoost is an efficient and scalable implementation of the gradient boosting algorithm. It uses clever penalization of the individual trees, and the trees are consequently allowed to have varying number of terminal nodes (Nielsen, 2016; Sheridan et al., 2016). It turns out that XGBoost achieves higher accuracy, whereas it requires less computing time. In this model, there are 4 important parameters, named eta (the step size shrinkage), max.depth (maximum depth of tree), min.child.weight (minimum sum of instance weight) and nrounds (the maximum number of iterations), which were optimized by the caret package. Ensemble learning methods are proposed by fusing multiple base models via voting or averaging, and obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone (Rokach, 2010). Especially, ensemble learning methods tend to generate much better results if there exists a significant diversity among different models. Recently these methods have been widely used in many fields, including toxicity prediction. Thirty-six base classifiers were built using the 3 machine learning algorithms and 12 molecular fingerprints. In this study, 35 ensemble models were proposed using these base classifiers. The best n (n = 36, 35, 34, 33…4, 3, 2) base classifiers with better predictive performance were fused to form the ensemble model by averaging the probabilities from the base classifiers. Figure 1 displays a flowchart of the ensemble model-building process. Figure 1. View largeDownload slide Flowchart of the ensemble model-building process. Figure 1. View largeDownload slide Flowchart of the ensemble model-building process. Performance Evaluation The predictive performance of the models will be evaluated by 5-fold cross-validation with 100 repeats and external validation. In 5-fold cross-validation, the original sample is randomly divided into 5 equal subsamples. Of the 5 subsamples, 4 ones are used as training data, and the remaining 1 is used as the validation for testing the model. The cross-validation process is then repeated 5 times, with each of the 5 subsamples used exactly once as the validation data (Arlot and Celisse, 2010). Besides, repeating this whole process for 100 times aims to reduce the randomness of the results and conduct a robust performance evaluation. The following 4 indicators were used to assess the predictive performance of the models: accuracy (Q), the overall prediction accuracy of hepatotoxicants and nonhepatotoxicants; sensitivity (SE), the prediction accuracy for hepatotoxicants; specificity (SP), the prediction accuracy for nonhepatotoxicants; the area under the receiver-operating characteristic curve (AUC). These indicators were calculated as follows: Q=TP+TNTP+TN+FN+FP×100%, (1) SE=TPTP+FN×100%, (2) SP=TNTN+FP×100%, (3) where true positive (TP) is the number of the hepatotoxicants that are correctly predicted, true negatives (TN) is the number of the nonhepatotoxicants that are correctly predicted, false positive (FP) is the number of the nonhepatotoxicants that are wrongly predicted as hepatotoxicants, false negative (FN) is the number of the hepatotoxicants that are wrongly predicted as nonhepatotoxicants. The receiver-operating characteristic (ROC) curve is a graphical plot of the TP rate (SE) against the FP rate (1 −SP) for the different possible cutoff points of a diagnostic test. The AUC was calculated as an important indicator for the prediction ability of the model. RESULTS Twelve types of molecular fingerprints and 3 machine learning methods were used to predict the hepatotoxicity of diverse organic compounds. Thirty-six base classifiers were generated, and their performances were evaluated by 5-fold cross-validation. The results are presented in Table 2. The results indicate that the SVM algorithm can achieve slightly higher SE whereas yielding relatively lower SP than other algorithms. In the training set, the accuracy (Q) ranged from 62.7% to 70.2%, the area under ROC curve (AUC) ranged from 0.674 to 0.748, the SE ranged from 66.2% to 81.6%, and the SP ranged from 40.7% to 62.6%. The most accurate classifier was generated by the SVM method using CDKExt fingerprints with a Q score of 70.2%, and the highest AUC value was produced by the RF method using FP4C fingerprints with an AUC score of 0.748. We conclude that different algorithms and fingerprints have different advantages and disadvantages in predicting hepatotoxicity. Table 2. Performance of the Base Classifiers in 5-fold Cross-validation Algorithms Fingerprints Q (%) SE (%) SP (%) AUC SVM CDK 69.6±2.6 78.5±3.8 58.6±4.5 0.744±0.028 CDKExt 70.2±2.6 79.3±3.8 59.1±4.7 0.747±0.027 CDKGraph 66.1±2.7 75.5±4.0 54.6±4.9 0.717±0.029 Estate 63.9±2.6 75.9±4.1 49.1±4.9 0.674±0.030 MACCS 68.7±2.6 77.3±4.0 58.3±4.8 0.733±0.027 Pubchem 69.4±2.6 77.5±3.9 59.6±4.8 0.742±0.028 FP4 66.8±2.6 75.7±4.2 55.8±4.5 0.713±0.029 FP4C 66.2±2.6 75.9±4.3 54.3±5.0 0.702±0.029 KR 66.9±2.6 76.3±4.3 55.3±4.7 0.723±0.028 KRC 65.8±2.7 72.7±5.1 57.4±5.3 0.710±0.028 AP2D 65.8±2.7 78.2±3.9 50.6±4.5 0.699±0.029 AP2DC 63.2±2.5 81.6±3.7 40.7±4.3 0.669±0.030 RF CDK 68.5±2.5 77.6±3.7 57.3±4.5 0.735±0.027 CDKExt 68.6±2.6 78.4±3.5 56.6±4.9 0.734±0.027 CDKGraph 66.7±2.7 74.4±4.1 57.2±4.7 0.717±0.029 Estate 64.6±2.7 69.4±4.1 58.8±4.7 0.692±0.029 MACCS 68.8±2.6 74.5±3.8 61.7±4.5 0.738±0.028 Pubchem 68.4±2.8 75.2±3.7 60.1±4.5 0.738±0.027 FP4 66.9±2.6 73.6±4.0 58.8±4.7 0.716±0.028 FP4C 69.4±2.5 75.8±3.4 61.5±4.6 0.748±0.026 KR 66.7±2.7 73.6±4.1 58.3±4.8 0.728±0.028 KRC 67.8±2.8 74.3±3.9 59.8±4.7 0.739±0.026 AP2D 65.8±2.7 73.0±3.9 57.1±4.5 0.710±0.028 AP2DC 65.7±2.9 71.7±4.3 58.4±4.8 0.701±0.031 XGBoost CDK 68.3±2.6 75.1±3.9 60.0±4.6 0.737±0.028 CDKExt 68.5±2.7 76.1±3.7 59.1±5.0 0.738±0.027 CDKGraph 66.1±2.6 71.6±4.2 59.4±4.6 0.720±0.027 Estate 62.7±2.9 66.2±4.4 58.4±4.9 0.675±0.030 MACCS 66.9±2.6 70.9±3.8 62.0±4.7 0.721±0.028 Pubchem 68.7±2.6 73.7±4.0 62.6±4.5 0.737±0.029 FP4 66.3±2.8 73.5±4.0 57.3±4.8 0.712±0.031 FP4C 68.5±2.4 73.8±3.6 62.0±4.3 0.733±0.027 KR 66.3±2.8 73.6±4.0 57.2±4.8 0.714±0.028 KRC 67.2±2.6 73.9±3.9 58.9±4.5 0.721±0.027 AP2D 64.9±2.8 70.9±4.1 57.6±4.6 0.694±0.030 AP2DC 64.3±2.7 69.4±4.3 58.0±4.6 0.683±0.029 Algorithms Fingerprints Q (%) SE (%) SP (%) AUC SVM CDK 69.6±2.6 78.5±3.8 58.6±4.5 0.744±0.028 CDKExt 70.2±2.6 79.3±3.8 59.1±4.7 0.747±0.027 CDKGraph 66.1±2.7 75.5±4.0 54.6±4.9 0.717±0.029 Estate 63.9±2.6 75.9±4.1 49.1±4.9 0.674±0.030 MACCS 68.7±2.6 77.3±4.0 58.3±4.8 0.733±0.027 Pubchem 69.4±2.6 77.5±3.9 59.6±4.8 0.742±0.028 FP4 66.8±2.6 75.7±4.2 55.8±4.5 0.713±0.029 FP4C 66.2±2.6 75.9±4.3 54.3±5.0 0.702±0.029 KR 66.9±2.6 76.3±4.3 55.3±4.7 0.723±0.028 KRC 65.8±2.7 72.7±5.1 57.4±5.3 0.710±0.028 AP2D 65.8±2.7 78.2±3.9 50.6±4.5 0.699±0.029 AP2DC 63.2±2.5 81.6±3.7 40.7±4.3 0.669±0.030 RF CDK 68.5±2.5 77.6±3.7 57.3±4.5 0.735±0.027 CDKExt 68.6±2.6 78.4±3.5 56.6±4.9 0.734±0.027 CDKGraph 66.7±2.7 74.4±4.1 57.2±4.7 0.717±0.029 Estate 64.6±2.7 69.4±4.1 58.8±4.7 0.692±0.029 MACCS 68.8±2.6 74.5±3.8 61.7±4.5 0.738±0.028 Pubchem 68.4±2.8 75.2±3.7 60.1±4.5 0.738±0.027 FP4 66.9±2.6 73.6±4.0 58.8±4.7 0.716±0.028 FP4C 69.4±2.5 75.8±3.4 61.5±4.6 0.748±0.026 KR 66.7±2.7 73.6±4.1 58.3±4.8 0.728±0.028 KRC 67.8±2.8 74.3±3.9 59.8±4.7 0.739±0.026 AP2D 65.8±2.7 73.0±3.9 57.1±4.5 0.710±0.028 AP2DC 65.7±2.9 71.7±4.3 58.4±4.8 0.701±0.031 XGBoost CDK 68.3±2.6 75.1±3.9 60.0±4.6 0.737±0.028 CDKExt 68.5±2.7 76.1±3.7 59.1±5.0 0.738±0.027 CDKGraph 66.1±2.6 71.6±4.2 59.4±4.6 0.720±0.027 Estate 62.7±2.9 66.2±4.4 58.4±4.9 0.675±0.030 MACCS 66.9±2.6 70.9±3.8 62.0±4.7 0.721±0.028 Pubchem 68.7±2.6 73.7±4.0 62.6±4.5 0.737±0.029 FP4 66.3±2.8 73.5±4.0 57.3±4.8 0.712±0.031 FP4C 68.5±2.4 73.8±3.6 62.0±4.3 0.733±0.027 KR 66.3±2.8 73.6±4.0 57.2±4.8 0.714±0.028 KRC 67.2±2.6 73.9±3.9 58.9±4.5 0.721±0.027 AP2D 64.9±2.8 70.9±4.1 57.6±4.6 0.694±0.030 AP2DC 64.3±2.7 69.4±4.3 58.0±4.6 0.683±0.029 The performance values are represented as means and standard deviation. Abbreviations: SVM, support vector machine; RF, random forest; XGBoost, extreme gradient boosting; Q, accuracy; SE, sensitivity; SP, specificity; AUC, area under the curve. Table 2. Performance of the Base Classifiers in 5-fold Cross-validation Algorithms Fingerprints Q (%) SE (%) SP (%) AUC SVM CDK 69.6±2.6 78.5±3.8 58.6±4.5 0.744±0.028 CDKExt 70.2±2.6 79.3±3.8 59.1±4.7 0.747±0.027 CDKGraph 66.1±2.7 75.5±4.0 54.6±4.9 0.717±0.029 Estate 63.9±2.6 75.9±4.1 49.1±4.9 0.674±0.030 MACCS 68.7±2.6 77.3±4.0 58.3±4.8 0.733±0.027 Pubchem 69.4±2.6 77.5±3.9 59.6±4.8 0.742±0.028 FP4 66.8±2.6 75.7±4.2 55.8±4.5 0.713±0.029 FP4C 66.2±2.6 75.9±4.3 54.3±5.0 0.702±0.029 KR 66.9±2.6 76.3±4.3 55.3±4.7 0.723±0.028 KRC 65.8±2.7 72.7±5.1 57.4±5.3 0.710±0.028 AP2D 65.8±2.7 78.2±3.9 50.6±4.5 0.699±0.029 AP2DC 63.2±2.5 81.6±3.7 40.7±4.3 0.669±0.030 RF CDK 68.5±2.5 77.6±3.7 57.3±4.5 0.735±0.027 CDKExt 68.6±2.6 78.4±3.5 56.6±4.9 0.734±0.027 CDKGraph 66.7±2.7 74.4±4.1 57.2±4.7 0.717±0.029 Estate 64.6±2.7 69.4±4.1 58.8±4.7 0.692±0.029 MACCS 68.8±2.6 74.5±3.8 61.7±4.5 0.738±0.028 Pubchem 68.4±2.8 75.2±3.7 60.1±4.5 0.738±0.027 FP4 66.9±2.6 73.6±4.0 58.8±4.7 0.716±0.028 FP4C 69.4±2.5 75.8±3.4 61.5±4.6 0.748±0.026 KR 66.7±2.7 73.6±4.1 58.3±4.8 0.728±0.028 KRC 67.8±2.8 74.3±3.9 59.8±4.7 0.739±0.026 AP2D 65.8±2.7 73.0±3.9 57.1±4.5 0.710±0.028 AP2DC 65.7±2.9 71.7±4.3 58.4±4.8 0.701±0.031 XGBoost CDK 68.3±2.6 75.1±3.9 60.0±4.6 0.737±0.028 CDKExt 68.5±2.7 76.1±3.7 59.1±5.0 0.738±0.027 CDKGraph 66.1±2.6 71.6±4.2 59.4±4.6 0.720±0.027 Estate 62.7±2.9 66.2±4.4 58.4±4.9 0.675±0.030 MACCS 66.9±2.6 70.9±3.8 62.0±4.7 0.721±0.028 Pubchem 68.7±2.6 73.7±4.0 62.6±4.5 0.737±0.029 FP4 66.3±2.8 73.5±4.0 57.3±4.8 0.712±0.031 FP4C 68.5±2.4 73.8±3.6 62.0±4.3 0.733±0.027 KR 66.3±2.8 73.6±4.0 57.2±4.8 0.714±0.028 KRC 67.2±2.6 73.9±3.9 58.9±4.5 0.721±0.027 AP2D 64.9±2.8 70.9±4.1 57.6±4.6 0.694±0.030 AP2DC 64.3±2.7 69.4±4.3 58.0±4.6 0.683±0.029 Algorithms Fingerprints Q (%) SE (%) SP (%) AUC SVM CDK 69.6±2.6 78.5±3.8 58.6±4.5 0.744±0.028 CDKExt 70.2±2.6 79.3±3.8 59.1±4.7 0.747±0.027 CDKGraph 66.1±2.7 75.5±4.0 54.6±4.9 0.717±0.029 Estate 63.9±2.6 75.9±4.1 49.1±4.9 0.674±0.030 MACCS 68.7±2.6 77.3±4.0 58.3±4.8 0.733±0.027 Pubchem 69.4±2.6 77.5±3.9 59.6±4.8 0.742±0.028 FP4 66.8±2.6 75.7±4.2 55.8±4.5 0.713±0.029 FP4C 66.2±2.6 75.9±4.3 54.3±5.0 0.702±0.029 KR 66.9±2.6 76.3±4.3 55.3±4.7 0.723±0.028 KRC 65.8±2.7 72.7±5.1 57.4±5.3 0.710±0.028 AP2D 65.8±2.7 78.2±3.9 50.6±4.5 0.699±0.029 AP2DC 63.2±2.5 81.6±3.7 40.7±4.3 0.669±0.030 RF CDK 68.5±2.5 77.6±3.7 57.3±4.5 0.735±0.027 CDKExt 68.6±2.6 78.4±3.5 56.6±4.9 0.734±0.027 CDKGraph 66.7±2.7 74.4±4.1 57.2±4.7 0.717±0.029 Estate 64.6±2.7 69.4±4.1 58.8±4.7 0.692±0.029 MACCS 68.8±2.6 74.5±3.8 61.7±4.5 0.738±0.028 Pubchem 68.4±2.8 75.2±3.7 60.1±4.5 0.738±0.027 FP4 66.9±2.6 73.6±4.0 58.8±4.7 0.716±0.028 FP4C 69.4±2.5 75.8±3.4 61.5±4.6 0.748±0.026 KR 66.7±2.7 73.6±4.1 58.3±4.8 0.728±0.028 KRC 67.8±2.8 74.3±3.9 59.8±4.7 0.739±0.026 AP2D 65.8±2.7 73.0±3.9 57.1±4.5 0.710±0.028 AP2DC 65.7±2.9 71.7±4.3 58.4±4.8 0.701±0.031 XGBoost CDK 68.3±2.6 75.1±3.9 60.0±4.6 0.737±0.028 CDKExt 68.5±2.7 76.1±3.7 59.1±5.0 0.738±0.027 CDKGraph 66.1±2.6 71.6±4.2 59.4±4.6 0.720±0.027 Estate 62.7±2.9 66.2±4.4 58.4±4.9 0.675±0.030 MACCS 66.9±2.6 70.9±3.8 62.0±4.7 0.721±0.028 Pubchem 68.7±2.6 73.7±4.0 62.6±4.5 0.737±0.029 FP4 66.3±2.8 73.5±4.0 57.3±4.8 0.712±0.031 FP4C 68.5±2.4 73.8±3.6 62.0±4.3 0.733±0.027 KR 66.3±2.8 73.6±4.0 57.2±4.8 0.714±0.028 KRC 67.2±2.6 73.9±3.9 58.9±4.5 0.721±0.027 AP2D 64.9±2.8 70.9±4.1 57.6±4.6 0.694±0.030 AP2DC 64.3±2.7 69.4±4.3 58.0±4.6 0.683±0.029 The performance values are represented as means and standard deviation. Abbreviations: SVM, support vector machine; RF, random forest; XGBoost, extreme gradient boosting; Q, accuracy; SE, sensitivity; SP, specificity; AUC, area under the curve. To integrate the advantages of various algorithms and fingerprints, we built several ensemble models based on 36 base classifiers. We first ordered the 36 classifiers by AUC, and then selected n of the classifiers with highest AUC value and averaged their prediction probabilities to predict hepatotoxicity again. This process generated 35 new ensemble models, and the performance indicators of the 35 ensemble models in cross-validation are presented in Supplementary Table 3. As expected, all ensemble models achieved significantly higher accuracy and AUC than any of the base classifiers. Besides, almost all the sensitivities and specificities in ensemble models were slightly higher than them in single models, and almost all the variances were reduced in ensemble models. Due to the combination of the diversity and independence of different models, the ensemble models obtained better predictive performance and more stable predicting outcomes without unbalanced classification. The numbers of most accurate base classifiers and their corresponding AUC values are shown in Figure 2. The best ensemble model consisted of 5 base classifiers: RF using FP4C, SVM using CDKExt, SVM using CDK, SVM using Pubchem, and RF using KRC. We named the model the Ensemble Top-5 model, with performance indicators of 71.1% ± 2.6% accuracy, 79.9% ± 3.6% SE, 60.3% ± 4.8% SP, and 0.764 ± 0.026 AUC. Compared with the most accurate classifier built by the SVM algorithm using CDKExt and the highest AUC value classifier built by the RF algorithm using FP4C, the accuracy of the combined model improved by 0.9%, and the AUC improved by 0.016. This indicates that the ensemble method can improve the performance of hepatotoxicity predictions. Figure 2. View largeDownload slide The numbers of most accurate base classifiers and their corresponding area under the curve (AUC) values. Figure 2. View largeDownload slide The numbers of most accurate base classifiers and their corresponding area under the curve (AUC) values. To assess the performance of our ensemble model in external validation, we used a dataset of 286 compounds from the LTKB database. These compounds were not included in the training set and were not involved in the model-building process, therefore the results could objectively reflect the ensemble model’s ability to predict hepatotoxicity of chemical compounds. The ensemble model yielded an accuracy of 84.3%, SE of 86.9%, SP of 75.4%, and AUC of 0.904. For comparison, we also evaluated the 36 base classifiers using the external validation, the results are presented in Supplementary Table 4. The results show that using single models to predict hepatotoxicity of compounds in LTKB has gained high accuracy and our Ensemble Top-5 model achieved higher accuracy and AUC than almost all the single models. Unlike the cross-validation, because of the occasionality, the performance metrics obtained using the external test set containing fewer compounds are not accurate enough. It is certainly that very few single models achieved better performance than the ensemble model in the external validation. This shows that our Ensemble Top-5 model can be not only efficient but also stable in predicting the hepatotoxicity of chemical compounds. We built a user-friendly web server, HepatoPred-EL, to provide convenient access to our models (http://ccsipb.lnu.edu.cn/toxicity/HepatoPred-EL/). DISCUSSION Comparison with Previous Methods Various computational methods for hepatotoxicity prediction have been reported (Chen et al., 2013b; Ekins et al., 2010; Liew et al., 2011; Zhang et al., 2016a,b). To ensure the precision of our analysis, we selected only QSAR methods that have been cross-validated for comparison. The main indicators of these methods are summarized in Table 3. Because datasets and feature selections vary, a direct comparison of methods can be flawed; nevertheless, we could still obtain some information by comparison. We drew the following conclusions from the comparison: (1) The accuracy of our model is higher than previous methods. But none of these methods offer high accuracy, as most of these accuracies are below 70%, this may due to the complexity of hepatotoxicity. (2) Our Ensemble Top-5 model achieved relatively high SE. Sensitivity reflects the ability to correctly identify hepatotoxicity, and is considered a more important indicator in assessing the quality of a classifier for hepatotoxicity prediction, because the superior ability to identify hepatotoxicants can warn drug developers of the potential adverseness of candidate compounds so that they could conduct further verification during the drug development. But on the other hand, this model with high SE and poor accuracy could also lead to a high risk of loss of good drugs, thus toxicity researchers and drug developers should balance the trade-off between the risk of losing good drugs and selecting toxic drugs. To address this problem, our model not only predicts whether the compound is hepatotoxic or not, but also provides the probability that the compound belongs to hepatotoxicants. Researchers can set appropriate thresholds based on the diseases they concentrate on and the purposes of these researches. For example, the default threshold of our model is set to 0.5, which means that the compounds with probabilities of hepatotoxicity greater than 0.5 would be classified as hepatotoxicants, and the others would be classified as nonhepatotoxicants. For researchers who focus more on toxicity, this threshold could be reduced, and for others who focus more on activity, this threshold could be increased. And these probabilities of hepatotoxicity have already been included in the prediction results of the web server we built for this model. (3) We used the LTKB database as our external dataset, and achieved an accuracy of 84.3%; Zhang et al. (2016b) achieved an accuracy of 72.6% using a dataset of 420 compounds, which contains 286 compounds from the LTKB database. By comparison, our model highly likely shows better performance. Nevertheless, DILI is indeed a very complex toxic endpoint and its complexity can cause inaccuracies in the computational models, and the essence of a positive or negative DILI signal for a drug relies heavily on the context of the disease being treated, the dose and duration intended. But under the current modeling technology, we could not fully consider the complexity of hepatotoxicity in the developing of the models. Although it is confirmed that our research could provide an excellent reference for the drug development, more efforts are needed to explore in this field. Table 3. Performance Indicators of Several Hepatotoxicity Prediction Models Reported in the Literature Model Name No. of Compounds Test Method Q (%) SE (%) SP (%) AUC References Bayesian 295 10-fold CV×100 58.5 52.8 65.5 0.62 Ekins et al. (2010) Mixed 1087 5-fold CV 63.8 64.1 63.3 0.676 Liew et al. (2011) Decision Forest 197 10-fold CV×2000 69.7 57.8 77.9 – Chen et al. (2013b) SVM 1317 Test set 66.5 92.9 24.0 0.651 Zhang et al. (2016a) Naïve Bayesian 420 Test set 72.6 72.5 72.7 – Zhang et al. (2016b) Ensemble-Top5 1241 5-fold CV×100 71.1 79.9 60.3 0.764 Present study Test set 84.3 86.9 75.4 0.904 Model Name No. of Compounds Test Method Q (%) SE (%) SP (%) AUC References Bayesian 295 10-fold CV×100 58.5 52.8 65.5 0.62 Ekins et al. (2010) Mixed 1087 5-fold CV 63.8 64.1 63.3 0.676 Liew et al. (2011) Decision Forest 197 10-fold CV×2000 69.7 57.8 77.9 – Chen et al. (2013b) SVM 1317 Test set 66.5 92.9 24.0 0.651 Zhang et al. (2016a) Naïve Bayesian 420 Test set 72.6 72.5 72.7 – Zhang et al. (2016b) Ensemble-Top5 1241 5-fold CV×100 71.1 79.9 60.3 0.764 Present study Test set 84.3 86.9 75.4 0.904 Abbreviations: SVM, support vector machine; Q, accuracy; SE, sensitivity; SP, specificity; AUC, area under the curve. Table 3. Performance Indicators of Several Hepatotoxicity Prediction Models Reported in the Literature Model Name No. of Compounds Test Method Q (%) SE (%) SP (%) AUC References Bayesian 295 10-fold CV×100 58.5 52.8 65.5 0.62 Ekins et al. (2010) Mixed 1087 5-fold CV 63.8 64.1 63.3 0.676 Liew et al. (2011) Decision Forest 197 10-fold CV×2000 69.7 57.8 77.9 – Chen et al. (2013b) SVM 1317 Test set 66.5 92.9 24.0 0.651 Zhang et al. (2016a) Naïve Bayesian 420 Test set 72.6 72.5 72.7 – Zhang et al. (2016b) Ensemble-Top5 1241 5-fold CV×100 71.1 79.9 60.3 0.764 Present study Test set 84.3 86.9 75.4 0.904 Model Name No. of Compounds Test Method Q (%) SE (%) SP (%) AUC References Bayesian 295 10-fold CV×100 58.5 52.8 65.5 0.62 Ekins et al. (2010) Mixed 1087 5-fold CV 63.8 64.1 63.3 0.676 Liew et al. (2011) Decision Forest 197 10-fold CV×2000 69.7 57.8 77.9 – Chen et al. (2013b) SVM 1317 Test set 66.5 92.9 24.0 0.651 Zhang et al. (2016a) Naïve Bayesian 420 Test set 72.6 72.5 72.7 – Zhang et al. (2016b) Ensemble-Top5 1241 5-fold CV×100 71.1 79.9 60.3 0.764 Present study Test set 84.3 86.9 75.4 0.904 Abbreviations: SVM, support vector machine; Q, accuracy; SE, sensitivity; SP, specificity; AUC, area under the curve. Substructures Related to Hepatotoxicity To better understand the contribution of several substructures related to hepatotoxicity prediction, the mean decrease in the Gini coefficient was used to assess the importance of these substructures with the RF algorithm (Siroky, 2009). In this study, every bit in molecular fingerprints corresponded to a chemical group, and we queried their fingerprint keys and the corresponding SMARTS patterns in the official documentation of PaDEL-Descriptor software. The important bits could indicate the risk of the hepatotoxicity of chemicals and offer reference value for drug screening and medicine development, thus we analyzed the important bits in the molecular fingerprints. Because CDK, CDKExt, and CDKGraph are hash fingerprints, and FP4C, KRC, and AP2DC are structural features count fingerprints, we calculated the mean decreases in the Gini coefficient using Estate, MACCS, Pubchem, FP4, KR, and AP2D fingerprints. For each type of fingerprint, we selected the 10 most important features based on mean decreases in the Gini coefficient (Figure 3). The higher the mean decrease in the Gini coefficient, the closer the feature is to hepatotoxicity prediction. The fingerprint keys and the substructures’ occurrence in hepatotoxicants and nonhepatotoxicants are presented in Table 4. Almost every feature appeared in more than half of all compounds in the 1241-compound dataset, indicating that these features are significant in the prediction process. With the exception of KR-298, all these features appeared more frequently in hepatotoxicants than in nonhepatotoxicants, implying that hepatotoxicants and nonhepatotoxicants differ in structure, and hepatotoxicants may have several particular substructures. This may also explain why the SE of our model was much higher than the SP. We therefore recommend that these substructures be taken into account when designing therapeutic compounds. Table 4. Top-Ranking Substructures and Their Occurrence in Hepatotoxicants and Nonhepatotoxicants Fingerprint Key Description SMARTS Pattern Present in Hepatotoxicants Present in Nonhepatotoxicants AP2D-102 O–O at topological distance 2 [#8]–[#8] 365 263 Estate-34 sOH [OD1H]−* 393 311 KR-298 –CHCH2< [!#1][CH2][CH]([!#1])[!#1] 296 336 MACCS-153 QCH2A [!#6;!#1]–[CH2]–* 432 430 Pubchem-257 more than 2 aromatic rings aromatic rings ≥2 344 218 FP4-287 Conjugated double bond *=*[*]=,#,:[*] 378 222 Fingerprint Key Description SMARTS Pattern Present in Hepatotoxicants Present in Nonhepatotoxicants AP2D-102 O–O at topological distance 2 [#8]–[#8] 365 263 Estate-34 sOH [OD1H]−* 393 311 KR-298 –CHCH2< [!#1][CH2][CH]([!#1])[!#1] 296 336 MACCS-153 QCH2A [!#6;!#1]–[CH2]–* 432 430 Pubchem-257 more than 2 aromatic rings aromatic rings ≥2 344 218 FP4-287 Conjugated double bond *=*[*]=,#,:[*] 378 222 Table 4. Top-Ranking Substructures and Their Occurrence in Hepatotoxicants and Nonhepatotoxicants Fingerprint Key Description SMARTS Pattern Present in Hepatotoxicants Present in Nonhepatotoxicants AP2D-102 O–O at topological distance 2 [#8]–[#8] 365 263 Estate-34 sOH [OD1H]−* 393 311 KR-298 –CHCH2< [!#1][CH2][CH]([!#1])[!#1] 296 336 MACCS-153 QCH2A [!#6;!#1]–[CH2]–* 432 430 Pubchem-257 more than 2 aromatic rings aromatic rings ≥2 344 218 FP4-287 Conjugated double bond *=*[*]=,#,:[*] 378 222 Fingerprint Key Description SMARTS Pattern Present in Hepatotoxicants Present in Nonhepatotoxicants AP2D-102 O–O at topological distance 2 [#8]–[#8] 365 263 Estate-34 sOH [OD1H]−* 393 311 KR-298 –CHCH2< [!#1][CH2][CH]([!#1])[!#1] 296 336 MACCS-153 QCH2A [!#6;!#1]–[CH2]–* 432 430 Pubchem-257 more than 2 aromatic rings aromatic rings ≥2 344 218 FP4-287 Conjugated double bond *=*[*]=,#,:[*] 378 222 Figure 3. View largeDownload slide The 10 most important features from each random forest (RF) model trained with Estate, MACCS, Pubchem, FP4, KR, and AP2D fingerprints. The mean decreases in the Gini coefficient values are represented as means and standard deviation. Figure 3. View largeDownload slide The 10 most important features from each random forest (RF) model trained with Estate, MACCS, Pubchem, FP4, KR, and AP2D fingerprints. The mean decreases in the Gini coefficient values are represented as means and standard deviation. CONCLUSION We used 3 machine learning algorithms and 12 molecular fingerprints of 1241 diverse compounds to build base prediction models of DILI. The accuracy ranged from 62.7% to 70.2% and the AUC ranged from 0.674 to 0.748. To improve the prediction performance, we used these base models to develop 35 ensemble models, all of which achieve higher accuracy than any base model, and the Ensemble Top-5 model achieves an accuracy of 71.1% ± 2.6%, SE of 79.9% ± 3.6%, SP of 60.3% ± 4.8%, and AUC of 0.764 ± 0.026 in 5-fold cross-validation and accuracy of 84.3%, SE of 86.9%, SP of 75.4%, and AUC of 0.904 in external validation. This indicates that our Ensemble Top-5 model can significantly improve the performance of hepatotoxicity prediction. Compared with previous methods, our Ensemble Top-5 model achieves relatively high accuracy and SE, and may be effective in assessing DILI risk during the early stages of drug discovery. SUPPLEMENTARY DATA Supplementary data are available at Toxicological Sciences online. ACKNOWLEDGMENTS The authors are grateful for the support from Engineering Laboratory for Molecular Simulation and Designing of Drug Molecules of Liaoning and Research Center for Computer Simulating and Information Processing of Bio-macromolecules of Liaoning Province. FUNDING National Natural Science Foundation of China (No. 31570160); Innovation Team Project (No. LT2015011) of Education Department of Liaoning Province, Important Scientific and Technical Achievements Transformation Project (No. Z17-5-078); and Applied Basic Research Project (Nos F16205151 and 17231104) of Science and Technology Bureau of Shenyang. REFERENCES Ai H. , Zhang L. , Chang A. K. , Wei H. , Che Y. , Liu H. ( 2014 ). Virtual screening of potential inhibitors from TCM for the CPSF30 binding site on the NS1A protein of influenza A virus . J. Mol. Model. 20 , 2142. Google Scholar Crossref Search ADS PubMed Ai H. , Zheng F. , Zhu C. , Sun T. , Li Z. , Liu X. , Li X. , Zhu G. , Liu H. ( 2010 ). Discovery of novel influenza inhibitors targeting the interaction of dsRNA with the NS1 protein by structure-based virtual screening . Int. J. Bioinform. Res. Appl. 6 , 449 – 460 . Google Scholar Crossref Search ADS PubMed Arlot S. , Celisse A. ( 2010 ). A survey of cross-validation procedures for model selection . Stat. Surv. 4 , 40 – 79 . Google Scholar Crossref Search ADS Bergstra J. , Bengio Y. ( 2012 ). Random search for hyper-parameter optimization . Journal of Machine Learning Research , 13 1 , 281 – 305 . Björnsson E. , Nordlinder H. , Olsson R. ( 2006 ). Clinical characteristics and prognostic markers in disulfiram-induced liver injury . J. Hepatol. 44 , 791. Google Scholar Crossref Search ADS PubMed Chen M. , Hong H. , Fang H. , Kelly R. , Zhou G. , Borlak J. , Tong W. ( 2013b ). Quantitative structure-activity relationship models for predicting drug-induced liver injury based on FDA-approved drug labeling annotation and using a large collection of drugs . Toxicol. Sci. 136 , 242. Google Scholar Crossref Search ADS Chen M. , Vijay V. , Shi Q. , Liu Z. , Fang H. , Tong W. ( 2011 ). FDA-approved drug labeling for the study of drug-induced liver injury . Drug Discov. Today 16 , 697 – 703 . Google Scholar Crossref Search ADS PubMed Chen M. , Zhang J. , Wang Y. , Liu Z. , Kelly R. , Zhou G. , Fang H. , Borlak J. , Tong W. ( 2013a ). The liver toxicity knowledge base: a systems approach to a complex end point . Clin. Pharmacol. Ther. 93 , 409 – 412 . Google Scholar Crossref Search ADS Chen T. , Guestrin C. ( 2016 ) XGBoost: A Scalable Tree Boosting System, ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp.785–794). ACM. Cheng F. , Li W. , Liu G. , Tang Y. ( 2013 ). In silico admet prediction: recent advances, current challenges and future trends . Curr. Top. Med. Chem. 13 , 1273 – 1289 . Google Scholar Crossref Search ADS PubMed Dearden J. C. ( 2016 ). The history and development of quantitative structure-activity relationships (QSARs) . Int. J. Qual. Struct.-Prop. Relat . 1 , 1 – 44 . Ekins S. ( 2014 ). Progress in computational toxicology . J. Pharmacol. Toxicol. Methods 69 , 115 – 140 . Google Scholar Crossref Search ADS PubMed Ekins S. , Williams A. J. , Xu J. J. ( 2010 ). A predictive ligand-based Bayesian model for human drug-induced liver injury . Drug Metab. Dispos. 38 , 2302 – 2308 . Google Scholar Crossref Search ADS PubMed Golbraikh A. , Tropsha A. ( 2002 ). Predictive QSAR modeling based on diversity sampling of experimental datasets for the training and test set selection . Mol. Diver. 5 , 231 – 243 . Google Scholar Crossref Search ADS Hastie T. , Tibshirani R. , Friedman J. H. ( 2001 ). Elements of Statistical Learning , Vol. 167 , pp. 192 – 192 . Springer , Berlin . Jennen D. , Polman J. , Bessem M. , Coonen M. , Delft J. V. , Kleinjans J. ( 2014 ). Drug-induced liver injury classification model based on in vitro human transcriptomics and in vivo rat clinical chemistry data . Syst. Biomed. 2 , 63 – 70 . Google Scholar Crossref Search ADS Karatzoglou A. , Smola A. , Hornik K. , Zeileis A. ( 2004 ). kernlab—An S4 Package for Kernel Methods in R . J. Stat. Softw. 11 , 721 – 729 . Google Scholar Crossref Search ADS Kuhn M. , Leeuw J. D. , Zeileis A. ( 2008 ). Building predictive models in R using the caret package . J. Stat. Softw. 28 , 1 – 26 . Google Scholar Crossref Search ADS PubMed Liaw A. , Wiener M. ( 2001 ). Classification and regression by RandomForest . R News 2 , 18 – 22 . Liew C. Y. , Lim Y. C. , Yap C. W. ( 2011 ). Mixed learning algorithms and features ensemble in hepatotoxicity prediction . J. Comput. Aided Mol. Des. 25 , 855. Google Scholar Crossref Search ADS PubMed Marzorati M. , Wittebolle L. , Boon N. , Daffonchio D. , Verstraete W. ( 2008 ). How to get more out of molecular fingerprints: practical tools for microbial ecology . Environ. Microbiol. 10 , 1571 – 1581 . Google Scholar Crossref Search ADS PubMed Merlot C. ( 2010 ). Computational toxicology—A tool for early safety evaluation . Drug Discov. Today 15 , 16. Google Scholar Crossref Search ADS PubMed Muster W. , Breidenbach A. , Fischer H. , Kirchner S. , Müller L. , Pähler A. ( 2008 ). Computational toxicology in drug development . Drug Discov. Today 13 , 303 – 310 . Google Scholar Crossref Search ADS PubMed Nielsen D. ( 2016 ). Tree boosting with XGBoost - why does XGBoost win “every” machine learning competition? (Master’s thesis, NTNU). Przybylak K. R. , Cronin M. T. ( 2012 ). In silico models for drug-induced liver injury—Current status . Expert Opin. Drug Metab. Toxicol. 8 , 201 – 217 . Google Scholar Crossref Search ADS PubMed Rokach L. ( 2010 ). Ensemble-based classifiers . Artif. Intell. Rev. 33 , 1 – 39 . Google Scholar Crossref Search ADS Segall M. D. , Barber C. ( 2014 ). Addressing toxicity risk when designing and selecting compounds in early drug discovery . Drug Discov. Today 19 , 688 – 693 . Google Scholar Crossref Search ADS PubMed Sheridan R. P. , Wang W. M. , Liaw A. , Ma J. , Gifford E. M. ( 2016 ). Extreme gradient boosting as a method for quantitative structure-activity relationships . J. Chem. Inf. Model , 56 12 , 2353 . Siroky D. S. ( 2009 ). Navigating random forests and related advances in algorithmic modeling . Stat. Surv. 3 , 147 – 163 . Google Scholar Crossref Search ADS Tomida T. , Okamura H. , Satsukawa M. , Yokoi T. , Konno Y. ( 2015 ). Multiparametric assay using HepaRG cells for predicting drug-induced liver injury . Toxicol. Lett. 236 , 16 – 24 . Google Scholar Crossref Search ADS PubMed Yap C. W. ( 2011 ). PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints . J. Comput. Chem. 32 , 1466 – 1474 . Google Scholar Crossref Search ADS PubMed Zhang C. , Cheng F. , Li W. , Liu G. , Lee P. W. , Tang Y. ( 2016a ). In silico prediction of drug induced liver toxicity using substructure pattern recognition method . Mol. Inform. 35 , 136. Zhang H. , Ding L. , Zou Y. , Hu S. Q. , Huang H. G. , Kong W. B. , Zhang J. ( 2016 ). Predicting drug-induced liver injury in human with Naïve Bayes classifier approach . J. Comput. Aided Mol. Des. 30 , 889–810. Zhang L. , Ai H. , Chen W. , Yin Z. , Hu H. , Zhu J. , Zhao J. , Zhao Q. , Liu H. ( 2017 ). CarcinoPred-EL: novel models for predicting the carcinogenicity of chemicals using molecular fingerprints and ensemble learning methods . Sci. Rep. 7 , 2118. Google Scholar Crossref Search ADS PubMed Zhu X. , Kruhlak N. L. ( 2014 ). Construction and analysis of a human hepatotoxicity database suitable for QSAR modeling using post-market safety data . Toxicology 321 , 62 – 72 . Google Scholar Crossref Search ADS PubMed © The Author(s) 2018. Published by Oxford University Press on behalf of the Society of Toxicology. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

Journal

Toxicological SciencesOxford University Press

Published: Sep 1, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off