TY - JOUR AU - Tong,, Weida AB - Abstract Three QSAR methods, artificial neural net (ANN), k-nearest neighbors (kNN), and Decision Forest (DF), were applied to 3363 diverse compounds tested for their Ames genotoxicity. The ratio of mutagens to non-mutagens was 60/40 for this dataset. This group of compounds includes >300 therapeutic drugs. All models were developed using the same initial set of 148 topological indices: molecular connectivity χ indices and electrotopological state indices (atom-type, bond-type and group-type E-state), as well as binary indicators. While previous studies have found logP to be a determining factor in genotoxicity, it was not found to be important by any modeling method employed in this study. The three models yielded an average training/test concordance value of 88%, with a low percentage of false positives and false negatives. External validation testing on 400 compounds not used for QSAR model development gave an average concordance of 82%. This value increased to 92% upon removal of less reliable outcomes, as determined by a reliability criterion used within each model. The ANN model showed the best performance in predicting drug compounds, yielding 97% concordance (34/35 drugs) after the removal of less reliable predictions. The appreciable commonality found among the top 10 ranked descriptors from each model is of particular interest because of the diversity in the learning algorithms and descriptor selection techniques employed in this study. Forty percent of the most important descriptors in any one model are found in one or two other models. Fourteen of the most important descriptors relate directly to known toxicophores involved in potent genotoxic responses in Salmonella typhimurium. A comparison of the validation results with those of MULTICASE and DEREK indicated that the new models presented in this work perform substantially better than the former models in predicting genotoxicity of therapeutic drugs. Substantially higher specificity was achieved with these new models as compared with MULTICASE or DEREK with comparable sensitivities among all models. Introduction In silico predictive models for genotoxicity fall into two principal categories: rule-based (expert system) and quantitative structure–activity relationship models (QSAR). Rule-based systems are built upon the earlier work of Miller and Miller (1977) and others (Ashby and Tennant, 1988, 1991; Ashby et al., 1989; Tennant and Ashby, 1991) who first systematized relationships between chemical substructures and observed toxic outcomes. Expert systems are composed of structural rules derived from specific toxicological mechanisms or plausible modes of action of chemical agents in combination with pattern recognition routines to identify substructures associated with specific toxic effects (Sanderson and Earnshaw, 1991). A great deal of effort is required in order for toxicologists to develop structural rules. Such rules, however, are favored by toxicologists because of their perceived transparency in the form of structure alerts that have a straightforward interpretation. The practice of relying on structural rules may be problematic. The designation of a substructure or fragment as a toxicophore is only a binary indication, as it is possible for the structure alert fragment to be present in a molecule without being a toxicophore in that molecule. The steric and electronic environment surrounding a structural alert fragment can diminish or enhance its genotoxic potency. The effect of molecular context can render the fragment non-toxic or create a toxic fragment that has not previously been identified. Both of these cases can lead to a false prediction. Recent work has used a phylogenetic tree-like algorithm (Bacha et al., 2002) to reduce the strict dependency of a yes/no rule for the presence/absence of a toxicophore. This method considers similar compounds with shared substructures, which can be either mutagenic or non-mutagenic. In this way, both active and inactive compounds contribute to the rule. It is possible that this technique may serve to partially alleviate the problems that arise in the use of structure alerts. To complicate matters further, genotoxic changes can occur without covalent bond formation with DNA through involvement of the moiety in chromosomal segregation, DNA breakage or interference in the activities of non-DNA targets. In these cases structure alert rules are absent because, to the authors' knowledge, structure alerts for non-DNA targets have not yet been defined. A QSAR model designed to predict genotoxicity correlates the chemical structure, given in terms of continuous or binary molecular descriptors, to a compound's biological activity. Many different types of descriptors have been used in predictive models for Ames bacterial mutagenicity: chemical substructures (Klopman et al., 1984, 1990), topological parameters (Basak et al., 2001; Accelrys Inc., 2003), logP and electronic parameters (King et al., 1996) and various combinations of different classes of descriptors such as geometric, electronic, polar surface area and topological (Mattioni et al., 2003). Most investigators agree that the validity of a QSAR model is based on its ability to predict biological activities for new chemical entities (NCEs) with a high degree of accuracy. A strong predictive ability depends on several factors: (i) the choice of descriptors used; (ii) the learning algorithm or methodology employed in model development; (iii) selection of the compounds used in the learning or training process. Lack of quality in any of the three factors usually translates into an unsuitable predictor for NCEs. Selection of a meaningful and statistically significant set of descriptors, those that correlate well with activity, is accomplished by using various stochastic sampling techniques: a genetic algorithm (Hopfinger and Patel, 1996) or simulated annealing (Kirkpatrick et al., 1983; Sutter et al., 1995) in conjunction with a correlation method, e.g. multiple linear regression (MLR). Algorithms to build predictive genotoxic models have utilized MLR, partial least squares (PLS), hierarchical QSAR and inductive logistic programming (ILP). These algorithms usually assume the existence of a linear relationship between biological activity and the molecular descriptors employed. However, given the quantity and variety of mechanisms involved in toxic interactions, the relationship between descriptor value and toxicity will not always be even approximately linear. In order to model nonlinear effects between descriptors and toxic activities, nonlinear QSAR methods make use of artificial neural net (ANN) analysis [using mainly back propagation (Devillers, 1996)] or k-nearest neighbor (kNN-QSAR) (Zheng and Tropsha, 2000) algorithms]. At this time we are unaware of any ANN or kNN QSAR model for genotoxicity having been reported for a large dataset of mutagens and non-mutagens. Structural and chemical diversity among members of a large dataset is a key consideration in the development of several new QSAR models for Ames genotoxicity in this study. Approximately 3400 chemically diverse compounds were used in this work. Approximately 3000 compounds served as the training/test set for model development and 400 compounds were used for external validation testing. The dataset included 329 therapeutic drugs, with 39 drugs in the 400 compound validation set. In model development the same set of topological descriptors (Hall and Kier, 1995; Kier and Hall, 1999) along with new proprietary descriptors (ChemSilico LLC) were used in all models. These structure variables include descriptors that encode whole molecule structure information as well as atom level descriptors that encode the electronic character of each atom, along with the topological environment and electronic influence of all other atoms in the molecule. Three different QSAR models were developed using three separate modeling techniques: ANN-QSAR, kNN-QSAR and Decision Forest (DF), an advanced tree-based classification algorithm. Materials and methods Data sources for compounds and their selection Data on bacterial mutagenicity for strains of Salmonella typhimurium were obtained from various sources: toxicological data from the National Library of Medicine's Toxicology Data Network (TOXNET, http://toxnet.nlm.nih.gov), Gold and Zeiger (1996), the Register of Toxic Effects of Chemical Substances and the Physicians Desk Reference, version 6.0a (2003). All drugs selected from the Physicians Desk Reference were later compared with those selected in a review of genotoxicity of pharmaceuticals (Snyder and Green, 2001) and found to be in complete agreement in terms of being an Ames mutagen or non-mutagen, with the exception of theophylline. Theophylline was treated as a non-mutagen since only mutagenicity data without metabolic activation by S9 rat liver microsomal preparation for S.typhimurium strains TA97, TA98, TA100, TA102, TA104, TA1535 and TA1537 were used. Compounds did not need to be tested on all seven strains to be included in the data set; a minimum of two strains was used. A number of additional preprocessing steps were performed on the data, such as the removal of duplicate structures and removal of compounds with conflicting reported mutagenicity values. For tracking and labeling purposes, we designated an active (mutagenic) or inactive (non-mutagenic) compound by a mutagenic index: MI = 0 for a non-mutagen and MI = 1 for a mutagen. A total of 3393 compounds were selected with a 60:40 ratio of mutagens to non-mutagens. Compound selection for the training/test and the external validation sets was random, with 12% of the initial compounds set aside for external validation testing. The structures and experimental Ames genotoxicity values for the compounds in the external validation set did not contribute to either descriptor selection or model development for the three methods examined in this investigation. Chemical structures for all compounds in the dataset were entered in Mol file format using structures from the TOXNET ChemPlus database (http://chem.sis.nlm.nih.gov/chemidplus/cmplxqry.html) and ChemDraw version 7.0. The methods employed in this study do not require 3-dimensional atom coordinates or an optimized minimum energy conformation for the input of structures. Molecular descriptors An initial set of 542 topological descriptor indices was computed by ChemSilico software (ChemSilico LLC, Tewksbury, MA) and reduced to a set of 148, using the criterion that at least 5% of the descriptor values must be non-constant (non-zero in most cases) for the 2963 compounds in the training/test set. The indices include molecular connectivity χ indices, atom-type, group-type, bond-type and single atom E-state and hydrogen E-state descriptors, κ shape indices, several binary indicators (e.g. presence of aromatic ring), topological polarity and others. The selected set of 148 descriptors was then further reduced in separate variable selection routines used by the three modeling algorithms in this study. Model development Model developed by ANN analysis ANN analysis was performed on 88% of the dataset (training/test), with 12% set aside for external validation. The training/test set, designated the principal set, was randomly split into 85% for training and 15% as a selection set for early stopping of the learning process to avoid over-fitting. The training set was selected randomly 10 times, forming 10 data sets using 75% of the training set, and a mutually exclusive 10% withholding set in which each compound appears only once in each withheld set. This multiple selection process produces a set of 10 models derived from the principal set using 10 mutually exclusive withholding sets. Using this approach, the non-contributory variables are pruned to give an optimal subset of significant variables. The initial starting set of 148 descriptors was reduced to 38 in the ANN analysis. The relative importance of each eliminated variable is based on its contribution across the entire training/withholding sets by calculation of r2 in each instance when the row (compound) appears in the withholding set. This value is designated q2, i.e. the r2 value for all instances in which the data were withheld from the modeling process. Since q2 is used to select the variables, it does not provide a reliable assessment of the predictive accuracy of the overall algorithm. This task is reserved for an external validation set. A standard back-propagation neural network was used for this study. The network contained no more than nine hidden neurons and utilized the backward elimination approach (Devillers, 1996; Miller, 2002), which has been adapted from traditional linear regression methods. The 10-fold cross-validation algorithm is used as a consensus model in which the average value of 10 neural nets gives the predicted MI value in the range 0–1 for the Ames bacterial mutagenicity of a compound. A threshold value of 0.5 was used in which threshold ≥0.5 means the compound is Ames positive, i.e. a mutagen (MI = 1). A possible standard for the identification of predictions that are potentially less reliable has arisen from examination of the standard deviation (SD) associated with the mean value from 10 ANN models that form the consensus. Predictions with a standard deviation value >0.27 were found to give a false prediction more often than those with a smaller standard deviation. Out of 2552 compounds with SD ≤ 0.27, only 226 compounds had incorrectly predicted MI values. Of the 411 remaining compounds with SD > 0.27, 306 were incorrectly predicted. For this reason, predictions with an associated SD >0.27 are flagged as having a lower reliability. Ranking of descriptors with respect to their importance was determined as the ratio of the difference in sum of squares of residuals (RSS) in the presence and absence of the variable divided by the smallest difference for the least important variable in the training/test set, using an average RSS value from all 10 ANN models. The model produced by this ANN procedure is commercially available in the ChemSilico Predict software (ChemSilico LLC). Model developed by Decision Forest Decision Forest is an algorithm (Tong et al., 2003) that creates and combines multiple heterogeneous but comparable decision tree models to achieve an improved prediction outcome. Development of a forest consists of four steps: (1) develop a qualified tree; (2) develop the next qualified tree based on only those descriptors not used in the previous tree(s); (3) repeat steps 1 and 2 until no additional qualified trees can be developed; (4) classify the activity of a compound based on the mean value of the predicted activities from four trees. Every tree uses a distinct set of descriptors; 144 descriptors were used among all trees. Decision Forest assigns a probability value to each prediction for which chemical compounds with probability values >0.5 are designated as mutagens (MI = 1); others are classified as non-mutagens. The algorithm assigns a confidence level to every prediction, which is defined as high confidence when the predicted value lies between 0.0 and 0.3 or 0.7 and 1.0, whereas the range from 0.30 to 0.70 is considered to be a low confidence region. This confidence value is used as a reliability criterion for the Decision Forest method. Ranking of variables was done by determining the frequency of each descriptor in a 10-fold cross-validation procedure that was repeated 50 times, using 10 training/test sets. Model developed by kNN-QSAR kNN-QSAR models are developed by an advanced nonlinear, non-parametric, variable selection technique that assigns predictions based on a test compound's similarity to training compounds with known activities (Zheng, 2000). Model development starts with use of a sphere exclusion algorithm (Golbraikh and Tropsha, 2002) to split the training data for 2963 compounds into 10 distinct training and test sets for model building and validation. Each training set consists of 50–80% of the available compounds; the remaining compounds are placed in an external test set to validate the models. To build classification models of genotoxicity, a recently developed method was implemented (the RWkNN Classification method), which evolved from an earlier original implementation (Zheng and Tropsha, 2000). This new approach is much faster for large datasets than the previous implementation. RWkNN Classification represents training set compounds as points in a multidimensional descriptor space and uses the distance between points to measure similarity between compounds. The descriptor space sought by the modeling procedure is a subset of the entire descriptor space, optimized by a simulated annealing-based sampling method. The central idea of this method assumes that compounds with similar activities are positioned relative to one another such that descriptor similarity correlates with activity similarity. The criterion used during descriptor space optimization is the ability of the space to achieve the above correlation. To do this, an optimal value of k is identified such that each compound in the training set can be best predicted by the k-most similar compounds (out of the remaining training set). The resulting model consists of an activity-relevant descriptor subspace, a defined k and training set compounds with known activities used to populate this descriptor subspace. An external test compound is predicted by identifying its location in the model descriptor space and then the k-most similar training set compounds are used to predict its class (toxic or not toxic) in a similarity-weighted fashion. Since the procedure as implemented requires a prespecified number of descriptors in the optimized subspace and there is no foreknowledge of what this number should be, a stochastic search was done within a specific range of values known to work well for kNN. In this study, descriptor numbers from 10 to 20 with an interval of 2 were searched, which resulted in 6 distinct values for the number of descriptors. At this point there are 60 differing sets of input parameters to search (10 training/test sets × 6 descriptor numbers). Three models were built for every set of input parameters, yielding 180 models in total. Each model was validated by its ability to predict its corresponding external test set not available during model building. Of the 180 models generated, 140 were capable of predicting a test set with greater than 75% accuracy. These 140 validated models were then used to predict the 400 blind external test compounds in a consensus fashion. The predictions made for each blind compound were collected and any compound that was not consensually predicted to belong to one specific class by at least 65% of the models was excluded; the prediction in this case was considered unreliable. The ability or inability to assign blind compounds to at least 65% of models was used as a reliability criterion for the kNN method. Ranking of variables was done by the frequency of occurrence of each descriptor in all 140 models. Dataset characterization The structural and chemical diversity of the 2963 compounds in the training dataset is illustrated in Table I. Compounds containing ring systems account for 80% of these compounds. N-heteroaromatic rings are present in 16% of the training set, with 64% of these compounds being mutagens. Non-heteroaromatic rings, fused or unfused, make up 71% of the training set, with 9% of all the training compounds having either unsaturated or partially saturated ring(s). Amines attached directly to aromatic rings (single or polycyclic) are found in 30% of compounds; 76% of these amine-bearing compounds are mutagens. Compounds in the training set have an average of six rotatable bonds per compound. Compounds containing halogens and nitro groups make up 20 and 26%, respectively, of the training set. Therapeutic drugs account for ∼10% of the 3363 compounds in the dataset. Only 12% of the drugs are Ames positive. Very similar percentages of compounds with these attributes are found among the 400 compounds in the external validation set, as shown in Table I. Table I. Compound attributes in train and validate sets Structure attribute . Train No.a . Ames-negative . Ames-positive . Validate No. . Ames-negative . Ames positive . Ring(s)b 2371 807 1574 314 103 211 N-heteroaromatic ring(s)c 462 155 307 58 19 39 Aromatic ring(s)d 2100 681 1419 283 87 196 Non-aromatic ring(s) 271 135 136 31 18 13 Ar-NH2e 417 77 340 54 13 41 AR-NHR 245 78 167 32 9 23 Ar-NH2 235 61 174 30 9 21 f 5.6 1127 1676 5.2 155 223 g 1.1 686 1024 1.1 92 124 3.8 1108 1661 3.7 152 222 Halogen(s)h 590 247 343 85 34 51 Amine(s)i 1383 485 898 179 68 111 −N= 370 50 320 62 15 47 −NO2j 284 36 248 47 4 43 243 244 242 238 226 245 Therapeutic drugs 290 255 35 39 37 2 Cpds Non-mutagens Mutagens cpds Non-mutagens Mutagens Total 2963 1161 1802 400 161 239 Structure attribute . Train No.a . Ames-negative . Ames-positive . Validate No. . Ames-negative . Ames positive . Ring(s)b 2371 807 1574 314 103 211 N-heteroaromatic ring(s)c 462 155 307 58 19 39 Aromatic ring(s)d 2100 681 1419 283 87 196 Non-aromatic ring(s) 271 135 136 31 18 13 Ar-NH2e 417 77 340 54 13 41 AR-NHR 245 78 167 32 9 23 Ar-NH2 235 61 174 30 9 21 f 5.6 1127 1676 5.2 155 223 g 1.1 686 1024 1.1 92 124 3.8 1108 1661 3.7 152 222 Halogen(s)h 590 247 343 85 34 51 Amine(s)i 1383 485 898 179 68 111 −N= 370 50 320 62 15 47 −NO2j 284 36 248 47 4 43 243 244 242 238 226 245 Therapeutic drugs 290 255 35 39 37 2 Cpds Non-mutagens Mutagens cpds Non-mutagens Mutagens Total 2963 1161 1802 400 161 239 a Number of compounds with given attribute or its average. Train means compounds with both train and test sets. b Ring(s): number of compounds with one or more rings. c N-heteroaromatic ring(s): number of compounds containing one or more such rings. d Aromatic ring(s): number of all compounds with a heteroaromatic or non-heteroaromatic ring system. e Ar-NH2, −NHR, −NR2: number of compounds with amines attached directly to an aromatic ring system. f : average number of rotatable bonds per compound. g : average number of H-bonds donors or acceptors per compound. h Halogen(s): number of compounds with one or more F, Cl, Br or I substituents. i Amine(s): number of compounds containing one or more −NH2, −NHR or NR2 groups. j NO2: number of compounds with one or more nitro groups. Open in new tab Table I. Compound attributes in train and validate sets Structure attribute . Train No.a . Ames-negative . Ames-positive . Validate No. . Ames-negative . Ames positive . Ring(s)b 2371 807 1574 314 103 211 N-heteroaromatic ring(s)c 462 155 307 58 19 39 Aromatic ring(s)d 2100 681 1419 283 87 196 Non-aromatic ring(s) 271 135 136 31 18 13 Ar-NH2e 417 77 340 54 13 41 AR-NHR 245 78 167 32 9 23 Ar-NH2 235 61 174 30 9 21 f 5.6 1127 1676 5.2 155 223 g 1.1 686 1024 1.1 92 124 3.8 1108 1661 3.7 152 222 Halogen(s)h 590 247 343 85 34 51 Amine(s)i 1383 485 898 179 68 111 −N= 370 50 320 62 15 47 −NO2j 284 36 248 47 4 43 243 244 242 238 226 245 Therapeutic drugs 290 255 35 39 37 2 Cpds Non-mutagens Mutagens cpds Non-mutagens Mutagens Total 2963 1161 1802 400 161 239 Structure attribute . Train No.a . Ames-negative . Ames-positive . Validate No. . Ames-negative . Ames positive . Ring(s)b 2371 807 1574 314 103 211 N-heteroaromatic ring(s)c 462 155 307 58 19 39 Aromatic ring(s)d 2100 681 1419 283 87 196 Non-aromatic ring(s) 271 135 136 31 18 13 Ar-NH2e 417 77 340 54 13 41 AR-NHR 245 78 167 32 9 23 Ar-NH2 235 61 174 30 9 21 f 5.6 1127 1676 5.2 155 223 g 1.1 686 1024 1.1 92 124 3.8 1108 1661 3.7 152 222 Halogen(s)h 590 247 343 85 34 51 Amine(s)i 1383 485 898 179 68 111 −N= 370 50 320 62 15 47 −NO2j 284 36 248 47 4 43 243 244 242 238 226 245 Therapeutic drugs 290 255 35 39 37 2 Cpds Non-mutagens Mutagens cpds Non-mutagens Mutagens Total 2963 1161 1802 400 161 239 a Number of compounds with given attribute or its average. Train means compounds with both train and test sets. b Ring(s): number of compounds with one or more rings. c N-heteroaromatic ring(s): number of compounds containing one or more such rings. d Aromatic ring(s): number of all compounds with a heteroaromatic or non-heteroaromatic ring system. e Ar-NH2, −NHR, −NR2: number of compounds with amines attached directly to an aromatic ring system. f : average number of rotatable bonds per compound. g : average number of H-bonds donors or acceptors per compound. h Halogen(s): number of compounds with one or more F, Cl, Br or I substituents. i Amine(s): number of compounds containing one or more −NH2, −NHR or NR2 groups. j NO2: number of compounds with one or more nitro groups. Open in new tab Results Statistical information Statistical parameters for the results from three predictive models, ANN-QSAR, DF and kNN-QSAR, are given in Table II for the training and validation sets as well as results on therapeutic drugs. All models output a continuous value between 0 and 1 for the predicted genotoxicity index, MI. An arbitrary threshold of 0.5 was used by the ANN, kNN and DF models: MI = 1 (mutagen) when threshold ≥0.5. To check whether this threshold value gives high sensitivity and specificity, areas under receiver operator characteristic (ROC) curves were computed (Analyse-It Software Ltd, 2003) for each model. A typical curve is shown in Figure 1 for the ANN-QSAR training dataset. A model with no predictive ability yields the diagonal line. The DF and kNN models show almost identical curves to that in Figure 1 with values of 0.94 and 0.95, respectively, as shown in Table II for the training set. The closer the area under the ROC curve is to 1, the greater is the predictive ability of the model. Small changes in the threshold value (±0.05) had no significant impact in all three models, but larger changes showed a significant change in ROC values. Use of this 0.5 threshold value gave average sensitivity and specificity values of 86 and 90%, respectively, for the three QSAR consensus models applied to a training set of 2963 compounds without implementing reliability threshold values. Fig. 1. Open in new tabDownload slide Receiver operator curve (ROC) for the ANN model of the 2963 training compounds. The dotted diagonal line represents outcomes when a model has no discriminatory power in predicting binary outcomes. Fig. 1. Open in new tabDownload slide Receiver operator curve (ROC) for the ANN model of the 2963 training compounds. The dotted diagonal line represents outcomes when a model has no discriminatory power in predicting binary outcomes. Table II. Statistical parameters on prediction of Ames genotoxicity from three QSAR models, Artificial Neural Network (ANN), k Nearest Neighbors (kNN) and the classification method Decision Forest (DF) Model . N . M . ROC . Concordance (train) . %F Pos (train) . %F Neg (train) . N . ROC . Concordance (validate) . %F Pos (validate) . %F Neg (validate) . ANN 2963 38 0.96 89% 2% 9% 400 (319) 0.93 84% (91%) 5% (2%) 11% (7%) DF 2963 144 0.94 90% 3% 8% 400 (244) 0.91 82% (93%) 7% (2%) 11% (5%) kNN 2963 51 0.95 84% 7% 9% 400 (300) 0.92 84% (92%) 8% (3%) 9% (5%) Therapeutic drugs Therapeutic drugs ANN 290 38 0.95 93% 4% 3% 39 (34) 92% (97%) 8% (3%) 0% (0%) DF 290 144 0.94 93% 5% 2% 39 (16) 83% (88%) 13% (12%) 4% (0%) kNN 290 51 0.88 90% 6% 4% 39 (26) 82% (96%) 18% (4%) 0% (0%) Model . N . M . ROC . Concordance (train) . %F Pos (train) . %F Neg (train) . N . ROC . Concordance (validate) . %F Pos (validate) . %F Neg (validate) . ANN 2963 38 0.96 89% 2% 9% 400 (319) 0.93 84% (91%) 5% (2%) 11% (7%) DF 2963 144 0.94 90% 3% 8% 400 (244) 0.91 82% (93%) 7% (2%) 11% (5%) kNN 2963 51 0.95 84% 7% 9% 400 (300) 0.92 84% (92%) 8% (3%) 9% (5%) Therapeutic drugs Therapeutic drugs ANN 290 38 0.95 93% 4% 3% 39 (34) 92% (97%) 8% (3%) 0% (0%) DF 290 144 0.94 93% 5% 2% 39 (16) 83% (88%) 13% (12%) 4% (0%) kNN 290 51 0.88 90% 6% 4% 39 (26) 82% (96%) 18% (4%) 0% (0%) Data set split into 2963 compounds for ratin/test and 400 validation testing. Values in parentheses are results from each concensus model through use of the model's threshold index. N: number of compounds; M: number of descriptors used in the concensus model; ROC: area under the receiver operator characteristic curve. Thirty-nine threshold compound data set too small to obtain reliable ROC measurements; %F Pos and %F Neg: percentages of Ames false positives and negatives. Open in new tab Table II. Statistical parameters on prediction of Ames genotoxicity from three QSAR models, Artificial Neural Network (ANN), k Nearest Neighbors (kNN) and the classification method Decision Forest (DF) Model . N . M . ROC . Concordance (train) . %F Pos (train) . %F Neg (train) . N . ROC . Concordance (validate) . %F Pos (validate) . %F Neg (validate) . ANN 2963 38 0.96 89% 2% 9% 400 (319) 0.93 84% (91%) 5% (2%) 11% (7%) DF 2963 144 0.94 90% 3% 8% 400 (244) 0.91 82% (93%) 7% (2%) 11% (5%) kNN 2963 51 0.95 84% 7% 9% 400 (300) 0.92 84% (92%) 8% (3%) 9% (5%) Therapeutic drugs Therapeutic drugs ANN 290 38 0.95 93% 4% 3% 39 (34) 92% (97%) 8% (3%) 0% (0%) DF 290 144 0.94 93% 5% 2% 39 (16) 83% (88%) 13% (12%) 4% (0%) kNN 290 51 0.88 90% 6% 4% 39 (26) 82% (96%) 18% (4%) 0% (0%) Model . N . M . ROC . Concordance (train) . %F Pos (train) . %F Neg (train) . N . ROC . Concordance (validate) . %F Pos (validate) . %F Neg (validate) . ANN 2963 38 0.96 89% 2% 9% 400 (319) 0.93 84% (91%) 5% (2%) 11% (7%) DF 2963 144 0.94 90% 3% 8% 400 (244) 0.91 82% (93%) 7% (2%) 11% (5%) kNN 2963 51 0.95 84% 7% 9% 400 (300) 0.92 84% (92%) 8% (3%) 9% (5%) Therapeutic drugs Therapeutic drugs ANN 290 38 0.95 93% 4% 3% 39 (34) 92% (97%) 8% (3%) 0% (0%) DF 290 144 0.94 93% 5% 2% 39 (16) 83% (88%) 13% (12%) 4% (0%) kNN 290 51 0.88 90% 6% 4% 39 (26) 82% (96%) 18% (4%) 0% (0%) Data set split into 2963 compounds for ratin/test and 400 validation testing. Values in parentheses are results from each concensus model through use of the model's threshold index. N: number of compounds; M: number of descriptors used in the concensus model; ROC: area under the receiver operator characteristic curve. Thirty-nine threshold compound data set too small to obtain reliable ROC measurements; %F Pos and %F Neg: percentages of Ames false positives and negatives. Open in new tab External validation information Each of the learning algorithms investigated in this study produced a consensus-based model with a solid statistical performance for external validation. This quality result is of interest, given the diverse methods involved in generation of the three models. The evaluation of these models is based on the overall concordance for both training and external validation compounds. The average values for the three models are 88 ± 3% (NN 89%, DF 90% and kNN 84%) and 83 ± 1% (NN 84%, DF 82% and kNN 84%), respectively, for the training and external validation sets. Averages were calculated without the exclusion of outcomes estimated to be less reliable. Each of the three methods employed a reliability criterion designed to identify outcomes estimated to be less reliable. Statistics are given in support of this approach for two of the methods evaluated. The first example is cited in the ANN methods section where the standard deviation across the prediction of 10 separate networks is used as a reliability criterion. MI values associated with SD > 0.27 are considered unreliable or suspect. The second example of such an improvement is given for the kNN method and is illustrated in Figure 2. Broadening the region of descriptor space for which kNN predictions are made (Z-score) produces a marked decrease in overall external test set accuracy. Fig. 2. Open in new tabDownload slide Comparison of test set percent accuracy for kNN as a function of Z-score. The Z-score value regulates the size of the model applicability domain, therefore, a smaller value forces the model to only predict test compounds with high similarity to those in the training set. Higher similarity to training set compounds implies higher external prediction confidence. Fig. 2. Open in new tabDownload slide Comparison of test set percent accuracy for kNN as a function of Z-score. The Z-score value regulates the size of the model applicability domain, therefore, a smaller value forces the model to only predict test compounds with high similarity to those in the training set. Higher similarity to training set compounds implies higher external prediction confidence. Identifying potentially unreliable predictions is considered an important feature for commercial considerations, where it is an advantage to be able to inform the user that a given prediction may be less reliable. When compounds are excluded from the validation set on the basis of the reliability criterion, the average concordance values for all models increases from 83 to 92% for the data shown in Table II. In the case of the ANN consensus model, its sensitivity increased from 82 to 88% and specificity from 88 to 94%. Likewise, with the kNN consensus model the sensitivity rose from 86 to 92% with a simultaneous increase in specificity from 81 to 92%. Concomitantly, the percentage of Ames false negatives with the ANN model decreased from 11 to 7%, with a smaller decline, 5 to 2%, for false positives. With the kNN models similar changes occurred; 11 to 5% for false negatives and 8 to 6% for false positives. The DF models showed the most change when the threshold was implemented, resulting in a drop from 11 to 5% in false negatives and from 7 to 2% for false positives. These changes reflected a sensitivity rise from 82 to 93% and specificity increase from 81 to 94%. Application of the respective reliability criterion for the ANN model indicated that ∼20% of the predictions should be flagged as less reliable, with ∼30% for the kNN model. The reliability criterion used for the model developed by the DF method indicated that ∼40% of its predictions should be considered less reliable. The three models differ somewhat in their ability to predict genotoxicity for therapeutic drugs. Among the models, an average concordance of 92 ± 2% was found for 290 drugs in the training set with ROC values of 0.95, 0.94 and 0.88 (Table II) for the ANN, DF and kNN models, respectively. It was impossible to assess the reliability of the ROC values for QSAR models for 39 validation drugs due to the small population size. This small dataset led to large fluctuations (0.2–0.4) in ROC value with small changes (<0.15) in the continuous MI outcomes when tested for stability. Moreover, the 39 drugs contained only two mutagens, hence, they are adequate to evaluate the specificity but not the sensitivity of the models. The ANN-QSAR model performs very well, yielding a 92% concordance value for all 39 drugs with 5% or less for false positives and negatives. The prediction rate increases to 97% (33/34 predicted correctly) after applying the reliability criterion to remove outcomes suspected of being less reliable. The specificity before and after applying the reliability threshold increases from 92 to 97%, as would be expected with so few mutagens. Statistics for the classification-based DF model changed from 83 to 88%, respectively, for concordance (essentially specificity in this case) with a minor reduction, 13 to 12%, in false positives. However, the 88% value followed removal of 60% of the drugs due to application of the reliability criterion (14 of 16 being correctly assigned as mutagen or non-mutagens after removal of less reliable predictions). The kNN-QSAR model had the most pronounced change, with an increase from 82 to 96% concordance with a simultaneous drop from 18 to 4% in false positives, with removal of 12 drugs (31% reduction in number of drugs) suspected of being unreliably predicted. The specificity increased from 81 to 96% upon removal of these 12 compounds. Since the concordance for both the kNN and DF models on the training set (290 drugs) is very good, it is necessary to address the possibility that the difference in external validation performance among the models may arise from the random selection of the 39 drugs used for external validation. In other words, it is necessary to confirm that the external validation compounds are representative of the training set in terms of descriptors used by each consensus model. Three hierarchical clustering procedures (Ward's method), each composed of 10 clusters (data not given), were run on the total dataset of 3363 compounds, employing the same descriptor values used in each consensus model. All the drugs are well represented in six or seven of the clusters, with no drug singletons present. The average percentage of validation drugs (as a percentage of total drugs in these clusters) was reasonably constant: 13, 14 and 15% for the kNN, DF and ANN models, respectively. Therefore, it is reasonable to state that the 39 drugs in the validation set were well represented in the training set used by all models. Structure descriptors Table III lists the top 10 ranked descriptors found in each consensus model along with their frequency of occurrence in the training dataset. It is of interest, given the different natures of the learning algorithms and the descriptor selection processes, that four of the top 10 ranked variables in any one model are common to one or two other models. These descriptors are shown in bold in Table III, along with their frequencies (see Table IV for descriptor definitions). Gmin and SdsN are found in all models. Four of the important descriptors (Gmin, Hmax, Gmax and Qv) in the ANN and DF models are what may be called global topological indices, as indicated by their frequency values: there exists a computable value for any organic molecule. The frequencies of different topological descriptors used by the three models are given in Figure 3. The topological descriptors were subdivided into classes: molecular connectivity χ indices, bond-type and atom-type E-state descriptors and hydrogen E-state descriptors and others, including molecular weight, number of rotatable bonds and number of hydrogen donors and acceptors. DF favored use of both atom-type E-state descriptors and molecular connectivity indices; 62% of all variables were in these two classes. For the ANN model, 19 and 37 descriptors are in common with the kNN and DF models, respectively, from among these classes. The percentages of descriptors used by the ANN and kNN models differed among the various classes, with the largest disparity being in the number of atom-type E-state and hydrogen E-state indices. Members of these classes accounted for 63% of the 38 descriptors used in the ANN models, as opposed to 47% of the 51 variables used by kNN-QSAR. Table III. The 10 most important descriptors in ANN, kNN and DF models, trend of ANN descriptors and frequency of variables in training compounds. Descriptors in bold were found to be important by more than one modeling technique. Descriptor definitions are given in Table IV . ANN model . . . DF model . . kNN model . . Ranka . Variable . Trendb . Frequencyc . Variable . Frequency . Variable . Frequency . 1 ArNH21 + 417 SdsN 370 SaasC 417 2 Hmax + 2963 SsssN 610 SdsN 370 3 ArHNH21 + 245 Qv 2963 ArN1 240 4 Gmax variable 2963 Gmax 2963 SddsN 264 5 SHBint2 variable 665 phia 2963 e1N2N3d 146 6 Qv − 2963 1Kα 2963 Gmin 2963 7 eaC2C3s + 1986 e2C3O1s 1103 SsCl 400 8 SdsN + 370 SHBa 2777 e2C3O1s 1103 9 Gmin + 2963 Gmin 2963 SHHBd 1710 10 SddsN + 284 SHCsats 1676 e2N3O1s 284 . ANN model . . . DF model . . kNN model . . Ranka . Variable . Trendb . Frequencyc . Variable . Frequency . Variable . Frequency . 1 ArNH21 + 417 SdsN 370 SaasC 417 2 Hmax + 2963 SsssN 610 SdsN 370 3 ArHNH21 + 245 Qv 2963 ArN1 240 4 Gmax variable 2963 Gmax 2963 SddsN 264 5 SHBint2 variable 665 phia 2963 e1N2N3d 146 6 Qv − 2963 1Kα 2963 Gmin 2963 7 eaC2C3s + 1986 e2C3O1s 1103 SsCl 400 8 SdsN + 370 SHBa 2777 e2C3O1s 1103 9 Gmin + 2963 Gmin 2963 SHHBd 1710 10 SddsN + 284 SHCsats 1676 e2N3O1s 284 a Rank for ANN model determined as the ratio of the difference in RSS (sum of squares of residues) in the presence and absence of the variable divided by the same difference for the least important variable averaged across all 10 ANN models. DF ranked descriptors determined by the frequency each descriptor was used in 50 times 10-fold cross validations. kNN rank is the number of times each descriptor occurred in 140 models. b A positive trend indicates that the Ames bacterial mutagenicity will tend to increase as the descriptor value increases. A negative trend indicates the inverse effect. A varible trend is dependent on the range of the descriptor value or interdependent on the value of other descriptors. The trend is a mean over the 10 ANN models. c Frequency is the number of compounds in the train/test sets expressing the topological descriptor. Open in new tab Table III. The 10 most important descriptors in ANN, kNN and DF models, trend of ANN descriptors and frequency of variables in training compounds. Descriptors in bold were found to be important by more than one modeling technique. Descriptor definitions are given in Table IV . ANN model . . . DF model . . kNN model . . Ranka . Variable . Trendb . Frequencyc . Variable . Frequency . Variable . Frequency . 1 ArNH21 + 417 SdsN 370 SaasC 417 2 Hmax + 2963 SsssN 610 SdsN 370 3 ArHNH21 + 245 Qv 2963 ArN1 240 4 Gmax variable 2963 Gmax 2963 SddsN 264 5 SHBint2 variable 665 phia 2963 e1N2N3d 146 6 Qv − 2963 1Kα 2963 Gmin 2963 7 eaC2C3s + 1986 e2C3O1s 1103 SsCl 400 8 SdsN + 370 SHBa 2777 e2C3O1s 1103 9 Gmin + 2963 Gmin 2963 SHHBd 1710 10 SddsN + 284 SHCsats 1676 e2N3O1s 284 . ANN model . . . DF model . . kNN model . . Ranka . Variable . Trendb . Frequencyc . Variable . Frequency . Variable . Frequency . 1 ArNH21 + 417 SdsN 370 SaasC 417 2 Hmax + 2963 SsssN 610 SdsN 370 3 ArHNH21 + 245 Qv 2963 ArN1 240 4 Gmax variable 2963 Gmax 2963 SddsN 264 5 SHBint2 variable 665 phia 2963 e1N2N3d 146 6 Qv − 2963 1Kα 2963 Gmin 2963 7 eaC2C3s + 1986 e2C3O1s 1103 SsCl 400 8 SdsN + 370 SHBa 2777 e2C3O1s 1103 9 Gmin + 2963 Gmin 2963 SHHBd 1710 10 SddsN + 284 SHCsats 1676 e2N3O1s 284 a Rank for ANN model determined as the ratio of the difference in RSS (sum of squares of residues) in the presence and absence of the variable divided by the same difference for the least important variable averaged across all 10 ANN models. DF ranked descriptors determined by the frequency each descriptor was used in 50 times 10-fold cross validations. kNN rank is the number of times each descriptor occurred in 140 models. b A positive trend indicates that the Ames bacterial mutagenicity will tend to increase as the descriptor value increases. A negative trend indicates the inverse effect. A varible trend is dependent on the range of the descriptor value or interdependent on the value of other descriptors. The trend is a mean over the 10 ANN models. c Frequency is the number of compounds in the train/test sets expressing the topological descriptor. Open in new tab Table IV. Summary definitions of all descriptors that were found to be in the top 10 in importance by the three modeling methods evaluated Open in new tab Table IV. Summary definitions of all descriptors that were found to be in the top 10 in importance by the three modeling methods evaluated Open in new tab Fig. 3. Open in new tabDownload slide Frequency of topological descriptors subdivided into classes. Black, gray and white rectangles are the KNN, DF and ANN models, respectively. Other includes descriptors such as molecular weight, flexibility and the number of rotational bonds. Fig. 3. Open in new tabDownload slide Frequency of topological descriptors subdivided into classes. Black, gray and white rectangles are the KNN, DF and ANN models, respectively. Other includes descriptors such as molecular weight, flexibility and the number of rotational bonds. Discussion Several studies on toxicity, including benzene toxicity (Hall et al., 1989a,b), phenol toxicity (Hall and Vaughn, 1997), toxicity to fish (Rose and Hall, 2003) and amide herbicide toxicity (Gough and Hall, 1999), have employed only topological indices in modeling structure–activity relationships (QSARs), as is done in this present study. Because calculated logP has been shown to be a useful parameter in some previous QSAR studies on toxicity, it was included in the 148 descriptors that were considered by each of the modeling methods. The three modeling methods employed in this study did not select logP to be included in the best model. As an individual descriptor logP showed a negligible correlation (r2 = 0.02) with the Ames mutagenicity for the compounds in the training set. Recently, a similar lack of correlation has been reported in another QSAR genotoxicity study (Mattioni et al., 2003). The models of Ames genotoxicity presented in this investigation perform very well in the prediction of external validation compounds. Based on the authors experience with these and other models, they contend that one of the critical determining factors for this success arises from the nature of the molecular structure representation employed in the model development process. Topological structure descriptors have been used in all these models. Three aspects of this approach form the basis for the strong models. First, whole molecular descriptors (usually molecular connectivity χ and κ shape indices) encode significant information on the general features of the molecules. Whole molecule descriptors encode information on general structural features such as molecular size and shape, as well as specific information on skeletal variations and complexity. These structural features are expected to have a relationship to properties arising from intermolecular interactions and may also function to provide discrimination among multiple structural classes. Second, atom-type, group-type, bond-type and single-atom E-state descriptors encode information on specific molecular features such as atom and bond types associated with important functional groups. For all three genotoxicity models some of the E-state descriptors found to be important relate directly to known structural alerts for mutagenicity (see Figure 4). Each of these E-state descriptors encodes three aspects of structure: electron accessibility at that atom or bond, the presence/absence of that feature and, finally, the count of that feature. We emphasize that these descriptors are not counts; they encode both the influence of the molecular context as well as the count. Because the E-state is a continuous value, a model based on E-state descriptors can correlate genotoxicity with a specific range of descriptor value, whereas the use of fragment-based structure alerts limits the model to a correlation with the presence/absence or simple count of a given fragment. As was stated in the Introduction, the practice of structure alerts has been demonstrated to lead to false predictions for this reason. Fig. 4. Open in new tabDownload slide 2-Dimensional structures (A–P). Bold, wide bonds show positions within structures where descriptors indicate a structural alert for Ames mutagenicity as found among the most important E-state indices. The positions of a bond E-state alert, eaC2C3s, and Hmax in several structures are also shown. Fig. 4. Open in new tabDownload slide 2-Dimensional structures (A–P). Bold, wide bonds show positions within structures where descriptors indicate a structural alert for Ames mutagenicity as found among the most important E-state indices. The positions of a bond E-state alert, eaC2C3s, and Hmax in several structures are also shown. The topological modeling approach is said to be mechanism-free, or at least not to be mechanism biased (Adamson and Bawden, 1976; Adamson and Bush, 1976; Hall, 2004). This aspect is very important because it is not necessary to assume any set of mechanistic steps in order to carry out the computation for these complicated biological properties. Further, it is also not necessary to make approximations in order to carry out the computations related to an assumed mechanism, a necessity when dealing with any assumed mechanism of interaction. A second critical determining factor for the success of the models presented in this study is the use of advanced nonlinear modeling and descriptor selection techniques. It is intuitive to assume that the presence of a genotoxic substructure in a molecule does not relate to toxicity in a linear fashion. A compound that contains five toxicophoric centers is not likely to be 5 times more genotoxic that a compound with one such structure. Such relationships are generally asymptotic, with sigmoidal being the most common example. There are many other factors that influence toxicity, such as solubility, the ability to penetrate biological membranes and the potential for catabolism, each of which has a complex relation to structure that may or may not be linear. The authors believe that the ability to thoroughly explore the nonlinear relation between structural characteristics and genotoxicity is essential to the development of a successful model. This includes the use of a variable selection process that can evaluate the potential of a descriptor to show a strong nonlinear correlation with the activity or property under consideration. These factors work together to provide the basis for strong modeling: use of a wide range of topological descriptors, including whole molecule and atom level descriptors, the use of advanced nonlinear descriptor selection and modeling techniques and the resulting development of a direct relationship between structure and activity. Structure interpretation Many of the most important descriptors shown in Table II relate directly to, or are associated with, structural alerts as reported by other investigators (Ashby and Tennant, 1991; King et al., 1996). For the purposes of this discussion, the descriptors will be presented in groupings related to their fundamental nature. The definition of each descriptor is given in Table IV. Atom-type E-state descriptors ArNH21 and ArHNH21 The descriptors ArNH21 and ArHNH21 encode electron accessibility of amino groups attached to aromatic ring systems. These substructures are prevalent among mutagens and non-mutagens, especially polycyclic aromatics (Kubo et al., 2002). According to the model, an increase in the count and/or the electron accessibility of either of these descriptors leads to an increase in the predicted genotoxicity value. SddsN The SddsN index (atom-type E-state descriptor for nitrogen in nitro groups), together with either ArNH21 or ArHNH21, signals common structural alerts for compounds containing either group or a combination of one or more nitro and amino groups. Molecules with larger SddsN values tend to have larger predicted genotoxicity values. SdsN The SdsN descriptor (E-state for the nitrogen atom type −N=), found in all models, is associated with the azo group, a structure alert (see structure A in Figure 4). Molecules with larger SdsN descriptor values tend to have larger calculated output values. SsssN and ArN1 In some cases, several E-state descriptors encode similar information about an atom type involved in a QSAR, one descriptor being more general than the other. Such is the case with SsssN and ArN1. The latter is the atom-type E-state descriptor for a tertiary nitrogen atom attached to an aromatic ring, whereas SsssN is the more general case, the atom-type E-state of all tertiary nitrogens in molecules. Tertiary nitrogen group alerts occur when the nitrogen is attached to either an aromatic or partially unsaturated ring (see structures B–D in Figure 4). The DF model selects SsssN, whereas the kNN model employs ArN1. Bond-type E-state descriptors Of interest is the selection of bond-type E-state descriptors, which encode the electron accessibility at the bond, along with perturbations from all other bonds in the molecule (Kier and Hall, 1999). e2C3O1s The bond-type descriptor e2C3O1s, found in the kNN and DF models, encodes electron accessibility of the carbonyl bond, >C=O, which can be a structure alert or part of an alert (see structures E and F in Figure 4). eaC2C3s The bond type descriptor, eaC2C3s, found in the ANN model, encodes an aromatic C–C bond in which one carbon is attached by a single bond to a substituent group. SaasC The bond type is analogous to the atom-type E-state SaasC in the kNN model, which is the atom-type E-state for aromatic carbons with an attached substituent atom. Neither the atom-type nor the bond-type E-state descriptor stands for an alert per se. However, their E-state values reflect the nature of the structural alert(s) attached to the ring system (see structures D, H and O in Figure 4). The bond type encoded in eaC2C3s was identified (King et al., 1996) as a structural alert for aromatic or heteroaromatic nitro compounds. e1N2N3d Another bond type descriptor, e1N2N3d (sum of E-state values for the >N–N= group), identifies a known structural alert (see structures I and J in Figure 4). e2N3O1s The presence of a nitro group is also reflected in the bond-type E-state e2N3O1s, found in the DF and kNN models. In this bond type the bond encoded is >N = O. Global E-state descriptors Four global descriptors, Hmax, Gmax, Qv and Gmin, are also found to be most important. Hmax Of significant interest is Hmax, since it was ranked number two in the ANN-QSAR and was also found to be of moderate importance in the DF model. It is the maximum hydrogen atom E-state value in a molecule, which signifies the largest polarity on a hydrogen atom in the molecule. It also correlates with partial charge. The Hmax values range from 0 to 3.1 in the training set. Examination of the 100 lowest (0–0.65) and highest (2.8–3.1) Hmax values of compounds in the training set reveals a significant trend with respect to Ames mutagenicity. Twenty-eight percent (28%) of compounds in the low Hmax range are mutagenic, as compared with 79% of compounds with high Hmax values. The presence of a highly polar hydrogen atom does not represent a structural alert in the typical sense of how a toxicophore is defined. Atoms encoded as Hmax tend to be associated with –OH, –NH2 and –NH– groups of acids, amines and amides and substituents attached to aromatic rings (see structures M, N and P). Furthermore, if the molecule does not possess any of these highly polar centers, the atom with the maximum hydrogen E-state value can still be significant. It is not clear whether these polar atoms are involved mechanistically in a non-nucleophilic interaction with DNA or whether Hmax reflects the influence of a nearby toxicophore. Despite the lack of a clear mechanistic implication, the study points to a statistically significant relationship between high Hmax values and Ames genotoxic potency. The same effect has also been associated with high mutagenic potency by King, utilizing electronic descriptors (King et al., 1996). Gmax The E-state descriptor Gmax is the largest E-state value in a molecule, usually associated with the most electronegative atom in the molecule, and there is a strong probability that its selection relates to structural alerts containing such moieties (structures G, K and L in Figure 4) that are adjacent to electrophilic centers. Gmin and Qv The E-state descriptors Gmin and Qv, respectively, are a measure of the most electrophilic atom in the molecule and the polarity of the molecule. Mechanistically, an electrophilic center is important for covalent bond formation with nucleophilic DNA and so it is not surprising that Gmin is found to be important in all three modeling methods. Qv is inversely proportional to the polarity and hydrophilicity of a molecule. The Qv descriptor reflects the presence of heteroatoms and polar functional groups, many of which are related to genotoxic structure alerts. Molecules with smaller values for Qv are more polar and tend to have larger predicted genotoxicity values. Group-type E-state descriptors Three other important descriptors are those for the group-type E-state and hydrogen E-state. SHBa and SHHBd The group-type E-state descriptors SHBa and SHHBd relate to toxicophores that have hydrogen bond acceptor or donor atoms. SHBa is the sum of E-state values for acceptors; SHHBd is the sum of hydrogen E-state values for donors. SHCsats The E-state descriptor SHCsats encodes E-state values for hydrogens on sp3 hybrid carbons bonded only to other sp3 carbon atoms. The electron accessibility of these sp3 hydrogens may relate in some manner to hydrophobic interactions between substrates and DNA or may have a relation to alkyl chlorides that are known toxicophores. Kappa shape descriptors phia and 1κα The κ shape descriptors phia and 1κα were found to be important in the DF consensus model. The 1κα index encodes information about the cyclicity of a molecule where long, straight chain molecules have the highest values and the value decreases with increases in globularity and cyclicity. The phia index relates to molecular flexibility and increases in value with greater molecular flexibility. The phia index is related to the number of rotatable bonds. With the exception of phia and 1κα, all of the important descriptors selected by the three modeling methods are directly related to the atom-type and bond-type E-state descriptors. The molecular connectivity χ indices demonstrate a more limited impact on the predicted genotoxicity value. Connectivity indices account for branching, type and composition of polycyclic aromatic ring systems and the structural patterns of members of the training set. The low order connectivity indices relate to fundamental parameters such as molecular surface area, molecular weight and the presence of heteroatoms. Such descriptors may function to provide discrimination among multiple structural classes. The higher order connectivity indices provide information about more complex skeletal variation and may also perform a classification function. Column three of Table III shows the trend for the 10 most important variables in the ANN model. The trend for seven of the descriptors shows that increased Ames mutagenic potency is associated with an increase in the descriptor magnitude. Two descriptors, SHBint2 and Gmax, showed a variable trend. The topological polarity descriptor Qv shows a negative trend, as molecules with smaller values for Qv tend to have larger predicted genotoxicity values. SHBint2 was selected for all the models with varying degrees of importance. It is the group-type E-state of a hydrogen donor and acceptor separated by two skeletal bonds (e.g. a carboxylic acid, primary or secondary amide or urea). There is no reported relationship between a structure descriptor that encodes SHBint2 information and a structural alert for genotoxicity. An interpretation of the affect that the number, positions and type of ring substituents have on mutagenic potency is an indication of the sensitivity of the three models presented in this study (see illustration in Figure 5). Examination of the first set of four structures in the left column indicates that they all have two or more of the same substituents with a small number of changes in their relative ring positions. Yet, replacement of a sulfonic acid group in a non-mutagen (structure 3 in Figure 5) with the nitro group in structure 4 converts it to a mutagen, which was predicted correctly by all the models. Again, in structures 5–8 the affects of the introduction of a different substituent or the absence of one was correctly predicted by the ANN-QSAR for structures 7 and 8 based on two somewhat similar templates in the training set. The kNN-QSAR and DF models incorrectly predicted genotoxicity for structure 7, indicating somewhat less sensitivity in this case to affects of substituents on potency. Given the diversity in the learning techniques used among these models, some differences in predictability with specific compounds or classes are to be expected. Fig. 5. Open in new tabDownload slide Examples of the sensitivity of ANN-QSAR to the presence and position of substituent groups in determining mutagenic potency of the structures shown in the left and right columns for the training and validation compounds. Similar sensitivities were found for the k-NN and DF models. Fig. 5. Open in new tabDownload slide Examples of the sensitivity of ANN-QSAR to the presence and position of substituent groups in determining mutagenic potency of the structures shown in the left and right columns for the training and validation compounds. Similar sensitivities were found for the k-NN and DF models. Comparison with other methods Two other modeling methods, multilinear and logistical regression, were examined in the course of this investigation. The results are not presented in this study because these methods produced models that proved to be rather poor predictors when applied to the subset of therapeutic drugs in training. When testing drug molecules, the external validation results for these methods showed an outcome that was nearly random. It is, however, of interest to make comparisons with the validation results on genotoxicty predictability obtained with two other models, MULTICASE and DEREK. MULTICASE is a commercial QSAR regression model that uses fragments and statistical rules to identify active and inactive fragments, while DEREK is a strictly rule-based commercial program. Our 400 compound validation set was processed through both MULTICASE (PCA2i) and DEREK (version 3.4) to predict mutagens and non-mutagens. 247 compounds in the validation set were found not to be in the MULTICASE training set. On the other hand, DEREK is not predicated directly on training compounds, but uses rules derived from the absence or presence of structural alerts. For all 400 validation compounds used in this study, DEREK gave 76% for concordance with 17% false positives and 7% false negatives, compared with the corresponding averages of 83, 6 and 11% for the three QSAR models (without threshold criteria being applied). Sensitivity (percent fraction of correctly predicted mutagens) and specificity (percent fraction of correctly predicted non-mutagens) for DEREK were 87 and 53%, compared with the corresponding averages of 84 and 83% for the three QSAR models. These QSAR models appear not to appreciably over-predict false positives as much as do the structural alerts while, at the same time, the QSAR models appear to slightly over-predict false negatives compared with DEREK; 7.5% false negatives (DEREK) versus an average of 11% for the three QSAR models. A comparison was made of all models (the three QSARs, MULTICASE and DEREK) on 247 validation compounds; again, no reliability threshold was applied to any model. Figure 6A presents results as percentages for all models for the 217 non-drugs (175 mutagens, 42 non-mutagens) within the 247 set. An insert in the figure provides actual numbers of compounds showing concordance, as well as numbers of false positives and negatives. Figure 6B shows similar results for the 30 drugs within the 247 compound validation set. For the 217 non-drug dataset ANN, kNN and DF gave a slightly improved average concordance (87%) than the average (82%) for the fragment and rule-based models. Sensitivities were 85, 90 and 87% for the ANN, kNN and DF models, respectively, compared with 91 and 86% for MULTICASE and DEREK, respectively. Fig. 6. Open in new tabDownload slide (A) Comparison of the three QSAR models discussed in this paper with two commercial models, MULTICASE and DEREK. The statistical comparisons are calculated on 217 validation (175 mutagens, 42 non-mutagens), non-drug NECs not present as training compounds for any model. The top bars are percent correct (concordance), with the lower bars listing percent false positives and negatives found among the five models. (Inset) The resultant number of compounds showing concordance and incorrect predictions for mutagens and non-mutagens. (B) Comparison of correct and incorrect predictions for 30 therapeutic drugs contained within 247 compounds found to be NCEs for all five models. (Inset) The resultant number of compounds showing concordance and incorrect predictions for mutagens and non-mutagens. These NECs contained 29 non-mutagens and 1 mutagen. Fig. 6. Open in new tabDownload slide (A) Comparison of the three QSAR models discussed in this paper with two commercial models, MULTICASE and DEREK. The statistical comparisons are calculated on 217 validation (175 mutagens, 42 non-mutagens), non-drug NECs not present as training compounds for any model. The top bars are percent correct (concordance), with the lower bars listing percent false positives and negatives found among the five models. (Inset) The resultant number of compounds showing concordance and incorrect predictions for mutagens and non-mutagens. (B) Comparison of correct and incorrect predictions for 30 therapeutic drugs contained within 247 compounds found to be NCEs for all five models. (Inset) The resultant number of compounds showing concordance and incorrect predictions for mutagens and non-mutagens. These NECs contained 29 non-mutagens and 1 mutagen. A large difference in performance occurred for specificity in dealing with the non-drugs; ANN performed best at 86%, with DF next at 83%, followed by the kNN model at 79%, as opposed to 50% for MULTICASE and 62% for DEREK. Additional large differences were also found for the 30 drugs in this subset of validation compounds, as shown in Figure 6B. ANN produced the highest concordance (93%), while MULTICASE and DEREK produced only 60 and 57% concordance, respectively, due to a significant tendency to over-predict false positives (40 and 43%, respectively). Unfortunately, nothing can be said about sensitivity with this small drug subset since there was only one mutagen among the 30 drugs. All models predicted this one mutagen correctly. In comparison with rule or fragment-based models, the ANN, kNN and DF models average 50% better in terms of concordance for therapeutic drugs when compared with the average from MULTICASE and DEREK. As with non-drugs, a much higher specificity was obtained with the three QSAR models (ANN, 93%; kNN, 83%; DF, 86%) as compared with 59 and 55% for MULTICASE and DEREK, respectively. The results reported here on 217 non-drugs are in reasonable agreement with a reported validation study (Pearl, 2001) using a 55:45 ratio of Ames mutagens to non-mutagens among 516 non-drug compounds. MULTICASE and DEREK gave concordances of 81 and 70% accompanied by moderate to low specificities (73 and 46%) and similar high sensitivities (87 and 89%, respectively) as found for 217 non-drugs studied here. In the same report (Pearl et al., 2001), results for 123 drugs using MULTICASE and DEREK gave modest results; 72 and 61% concordances with moderate to low specificities (59 and 49%, respectively). These concordances are somewhat higher than those found for the 30 drugs used in the present validation study but the specificities are consistent with those reported here. Structures and tabulated results for the 400 validation compounds for all five models (three QSARS, MULTICASE and DEREK) can be found at www.mutage.oupjournals/org or www.chemsilico/supplemental_information. Conclusions The important factors in the development of robust predictive models for Ames mutagenicity, as found in this study, include the use of consensus models, structure description based on topological descriptors and a large, diverse set of training compounds. The nature of the models developed by the three methods in this study allows the potential for identification of less reliable predictions; use of reliability criteria yields excellent concordance values (≥91%) with improved sensitivity and specificity after such outcomes have been removed in validation testing. We conclude that this result is a key attribute of the ANN, kNN and DF consensus models. Unavoidable noisy end-points within a dataset certainly contribute negatively to the predictability of a QSAR model. It is reported that S.typhimurium mutagenicity testing performed by experienced laboratories using standardized protocols leads to a reproducibility of between 84 and 87% (Piegorisch and Zeiger, 1991). Consensus QSAR modeling, as reported here, diminishes the effects of noisy data in that all individual models within the main predictor contain varying amounts of noisy data as opposed to just the development of a single non-consensus model, hence, this averaging of noisy data permits more reliable predictions with consensus modeling than without. Although in this study use of reliability thresholds in different QSARs targeted anywhere from 20 to 40% of the 400 compounds in the validation set, it is our opinion that it is much better to identify compounds which may possibly be less reliable predictions rather than not. This conclusion is reinforced in the light of high sensitivities and specificities in the 90% range that are achieved with statistically more reliable outcomes. The three models developed here, ANN, kNN and DF have approximately the same level of robustness for the complete validation set of compounds, with larger differences with regard to therapeutic drugs. The ANN model shows somewhat better performance than either the kNN or DF model for this class of compounds. DF performs equally well as the other QSAR models (ANN and kNN) in overall validation results without application of its reliability criterion. However, use of the DF reliability criterion appears to cause both over- and under-prediction and leads to the removal of a large number of predictions deemed less reliable with both non-drugs and therapeutic ones. Use of topological descriptors and use of training or learning algorithms for these three QSAR models showed substantially better concordances with drugs as well as much higher specificity when applied to both non-drugs and therapeutic entities compared with MULTICASE and DEREK. However, all models have comparable sensitivities without implementing threshold criteria for the former three models. The modeling approach described in this study has also produced very strong QSAR models for aqueous solubility and human intestinal absorption (Lobell and Sivarajah, 2004; Votano et al., 2004). In our overall methodology heavy emphasis is placed on the use of representative external validation test sets. The strength of these models arises from a combination of high quality topological structure information representation, well-developed nonlinear modeling methods and data sets that have been carefully assembled and screened for anomalies. The appreciable degree of commonality among the most important descriptors selected by the three modeling methods is also of interest, especially in the light of the diverse methods [training (DF) versus learning (ANN and kNN)] and modeling algorithms used in this study. The topological structure representation modeling process produces a set of descriptors that shows a significant relation to genotoxicity, in a statistical sense. Additional examination of the nature of the topological descriptors reveals significant structural information, including a relationship to known structure alerts. Topological structure representation (molecular connectivity χ indices, electrotopological state descriptors and κ shape indices) produces a set of descriptors that encodes information about both molecular structure fragments and their molecular context. Based on this work, topological descriptors encode significant information that, in conjunction with large datasets and nonlinear modeling algorithms, leads to the high quality models presented here. Supplementary material Supplementary material can be found at: http://www.mutage.oupjournals.org. Conflict of interest statement
 J.R.Votano, M.Parham, L.H.Hall and L.B.Kier are partners in ChemSilico LLC. One of the models described in this paper was developed by ChemSilico LLC. We would like to thank Dr Constantine Kreatsoulas for performing the Ames genotoxicity predictions using several commercial predictors. References Accelrys Inc. ( 2003 ) TOPKAT software. Accelrys Inc., San Diego, CA. Adamson,G.W. and Bawden,D. ( 1976 ) An empirical method of structure activity correlation for polysubstituted cyclic compounds using Wiswesser line notation. J. Chem. Inf. Comput. Sci. , 16 , 161 –169. Adamson,G.W. and Bush,J.A. ( 1976 ) Evaluation of an empirical structure–activity relationship for property prediction in a structurally diverse group of local anesthetics. J. Chem. Soc. Perkin Trans. , 1 , 168 –179. Analyse-It Software Ltd ( 2003 ) Analyse-It. Analyse-It Software Ltd, Leeds, UK. Ashby,J. and Tennant,R.W. ( 1988 ) Chemical-structure, Salmonella mutagenicity and extent of carcinogenicity as indicators of genotoxic carcinogenesis, among 222 chemicals tested in rodents by the United-States NCI/NTP. Mutat. Res. , 204 , 17 –115. Ashby,J. and Tennant,R.W. ( 1991 ) Definitive relationships among chemical structure, carcinogenicity and mutagenicity for 301 chemicals tested by the U.S. NTP. Mutat. Res. , 257 , 229 –306. Ashby,J., Tennant R.W., Zeiger,E. and Stasiewiz,S. ( 1989 ) Classification according to chemical structure, mutagenicity to Salmonella and level of carcinogenicity of a further 42 chemicals tested for carcinogenicity by the U.S. National Toxicology Program. Mutat. Res. , 223 , 73 –103. Bacha,P.A., Gruver,H.S., Den Hartog,B.K., Tamura,S.Y. and Nutt,R.F. ( 2002 ) Rule extraction from a mutagenicity data set using adaptively grown phylogenetic-like trees. J. Chem. Inf. Comput. Sci. , 42 , 1104 –1111. Baska,S.C., Mills,D.R., Balaban,A.T. and Gute,B.D. ( 2001 ) Prediction of mutagenicity of aromatic and heteroaromatic amines from structure: a hierarchical approach. J. Chem. Inf. Comput. Sci. , 41 , 671 –678. Devillers,J. ( 1996 ) Strengths and weaknesses of the back propagation neural network in QSAR and QSPR studies. In Devillers,J. (ed.) Genetic Algorithms in Molecular Modeling. Academic Press, San Diego, CA, pp. 1–24. Golbraikh,A. and Tropsha,A. ( 2002 ) Predictive QSAR modeling based on diversity sampling of experimental datasets for the training and test set selection. J. Comput. Aided Mol. Des. , 16 , 357 –369. Gold,L.S. and Zeiger,E. ( 1996 ) Handbook of Carcinogenic Potency and Genotoxicity Databases. CRC Press, Boca Raton, FL. Gough,J. and Hall,L.H. ( 1999 ) Modeling the toxicity of amide herbicides using the electrotopological state. J. Environ. Toxicol. Chem. , 18 , 1069 –1075. Hall,L.H. ( 2004 ) A structure-information approach to prediction of biological activities and properties. Chem. Biodiversity , in press. Hall,L.H. and Kier,L.B. ( 1995 ) Electrotopological state indices for atom types: a novel combination of electronic, topological and valence state information. J. Chem. Inf. Comput. Sci. , 35 , 1039 –1045. Hall,L.H. and Vaughn,T.A. ( 1997 ) QSAR of phenol toxicity using electrotopological state and kappa shape indices. Med. Chem. Res. , 7 , 407 –416. Hall,L.H., Maynard,E.L. and Kier,L.B. ( 1989 ) Structure–activity relationship studies on the toxicity of benzene derivatives: III. Predictions and extension to new substituents. J. Environ. Toxicol. Chem. , 8 , 431 –436. Hall,L.H., Maynard,E.L. and Kier,L.B. ( 1989 ) QSAR investigation of benzene toxicity to fathead minnows using molecular connectivity. J. Environ. Toxicol. Chem. , 8 , 783 –788. Hopfinger,A.J. and Patel,H.C. ( 1996 ) Application of genetic algorithms to the general QSAR problem and to guiding—molecular diversity experiments. In Devillers,J. (ed.) Genetic Algorithms in Molecular Modeling. Academic Press, San Diego, CA, pp. 131–154. Kier,L.B. and Hall,L.H. ( 1999 ) Molecular Structure Description: The Electrotopological State. Academic Press, San Diego, CA. King,R.D., Muggleton,S.H., Srinivasan,A. and Sternberg,M.J. ( 1996 ) Structure–activity relationships derived by machine language learning: the use of atoms and their bond connectivities to predict mutagenicity by inductive logic programming, Proc. Natl Acad. Sci. USA , 93 , 438 –442. Kirkpatrick,S., Gelatt,C.D.,Jr and Vecchi,M.P. ( 1983 ) Optimization by simulated annealing. Science , 220 , 671 –680. Klopman,G., Frierson,M.R. and Rosenkranz,H.S. ( 1984 ) Structural requirements for mutagenicity of environmental nitroarenes. Mutat. Res. , 126 , 277 –238. Klopman,G., Frierson,M.R. and Rosenkranz,H.S. ( 1990 ) The structural basis of the mutagenicity of chemicals in Salmonella typhimurium: the Geno-Tox Data Base. Mutat. Res. , 228 , 1 –50. Kubo,T., Urano,K. and Utsumi,H. ( 2002 ) Mutagenicity characteristics of 255 environmental chemicals. J. Health Sci. , 48 , 545 –554. Lobell,M. and Sivarajah,V. ( 2004 ) In silico prediction of aqueous solubility, human plasma protein binding and volume distribution from calculated pKa and AlogP98 values. J. Mol. Diversity , 7 , 69 –87. Mattioni,B.E., Kauffman,G.W. and Jurs,P.C. ( 2003 ) Predicting the genotoxicity of secondary and aromatic amines using data subsetting to generate a model ensemble. J. Chem. Inf. Comput. Sci. , 43 , 949 –963. Miller,A. ( 2002 ) Subset Selection in Regression, 2nd Edn. Chapman and Hall/CRC Press, Boca Raton, FL. Miller,J.A. and Miller,E.C. ( 1977 ) Ultimate chemical carcinogen as reactive mutagenic electrophiles. In Hiatt,H.H., Watson,J.D. and Winsten,J.A. (eds) Origin of Human Cancers. Cold Spring Harbor Laboratory Press: Cold Spring Harbor, NY, pp. 605–627. Pearl,G.M., Livingston-Carr,S. and Durham,S.K. ( 2001 ) Integration of computational analysis as a sentinel tool in toxicological assessments, Curr. Topics Med. Chem. , 1 , 247 –255. Piegorisch,W.W. and Zeiger,E. ( 1991 ) Measuring intra-assay agreement for Ames Salmonella assay. In Hotorn,L. (ed.) Lecture Notes in Informatics, Vol. 43, Statistical Methods in Toxicology. Springer, Heidelberg, pp. 35–41. Rose,K. and Hall,L.H. ( 2003 ) E-state modeling of fish toxicity independent of 3D structure information, SAR QSAR. Environ. Res. , 14 , 113 –129. Sanderson,D.M. and Earnshaw,C.G. ( 1991 ) Computer prediction of possible toxic action from chemical structure; the DEREK system. Hum. Exp. Toxicol. , 10 , 261 –273. Snyder,R.D. and Green,J.W. ( 2001 ) A review of the genotoxicity of marked pharmaceuticals. Mutat. Res. , 488 , 151 –159. Sutter,J.M., Dixon,S.L. and Jurs,P.C. ( 1995 ) Automated descriptor selection for quantitative structure–activity relationships using generalized simulated annealing. J. Chem. Inf. Comput. Sci. , 35 , 77 –84. Tennant,R.W. and Ashby,J. ( 1991 ) Classification according to chemical structure, mutagenicity to Salmonella and level of carcinogenicity of a further 39 chemicals tested for carcinogenicity by the U.S. National Toxicology Program. Mutat. Res. , 257 , 209 –227. Tong,W., Hong,H., Fang,H., Xie,Q. and Perkins,R. ( 2003 ) Decision Forest: combining the predictions of multiple independent decision tree models. J. Chem. Inf. Comput. Sci. , 43 , 525 –531. Votano,J.R., Parham,M., Hall,L.H., Kier,L.B. and Hall,L.M. ( 2004 ) Prediction of aqueous solubility based on large datasets using several QSPR models utilizing topological structure representation. Chem. Biodiversity , in press. Zheng,W. and Tropsha,A. ( 2000 ) Novel variable selection quantitative structure–property relationship approach based on the k-nearest-neighbor principle. J. Chem. Inf. Comput. Sci. , 40 , 185 –194. Author notes ChemSilico LLC, 48 Baldwin Street, Tewksbury, MA 01876, USA, 1Department of Chemistry, Eastern Nazarene College, Quincy, MA 02170, USA, 2Department of Medicinal Chemistry, School of Pharmacy, Virginia Commonwealth University, Richmond, VA 23298, USA, 3Laboratory for Molecular Modeling, Division of Medicinal Chemistry and Natural Products, School of Pharmacy, University of North Carolina, Chapel Hill, NC 27599, USA and 4Center for Toxicoinformatics, Division of Biometry and Risk Assessment, National Center for Toxicological Research, Jefferson, AK 72079, USA Mutagenesis vol. 19 no. 5 © UK Environmental Mutagen Society 2004; all rights reserved. TI - Three new consensus QSAR models for the prediction of Ames genotoxicity JO - Mutagenesis DO - 10.1093/mutage/geh043 DA - 2004-09-01 UR - https://www.deepdyve.com/lp/oxford-university-press/three-new-consensus-qsar-models-for-the-prediction-of-ames-spvMYoAVzb SP - 365 EP - 377 VL - 19 IS - 5 DP - DeepDyve ER -