Access the full text.
Sign up today, get DeepDyve free for 14 days.
Yuchun Tang, Yanqing Zhang, N. Chawla, S. Krasser (2009)
SVMs Modeling for Highly Imbalanced ClassificationIEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39
C. Southan, P. Várkonyi, S. Muresan (2007)
Complementarity between public and commercial databases: new opportunities in medicinal chemistry informatics.Current topics in medicinal chemistry, 7 15
Yiqun Cao, Tao Jiang, T. Girke (2008)
A maximum common substructure-based algorithm for searching and predicting drug-like compoundsBioinformatics, 24
J. Hsieh, X. Wang, Denise Teotico, A. Golbraikh, A. Tropsha (2008)
Differentiation of AmpC beta-lactamase binders vs. decoys using classification kNN QSAR modeling and application of the QSAR classifier to virtual screeningJournal of Computer-Aided Molecular Design, 22
T. Frimurer, R. Bywater, L. Nærum, Leif Lauritsen, S. Brunak (2000)
Improving the Odds in Discriminating "Drug-like" from "Non Drug-like" CompoundsJournal of chemical information and computer sciences, 40 6
Chih-Chung Chang, Chih-Jen Lin (2011)
LIBSVM: A library for support vector machinesACM Trans. Intell. Syst. Technol., 2
Lianyi Han, Yanli Wang, S. Bryant (2008)
Developing and validating predictive decision tree models from mining chemical structural fingerprints and high–throughput screening data in PubChemBMC Bioinformatics, 9
R. Nakai, Cleo Salisbury, H. Rosen, B. Cravatt (2009)
Ranking the selectivity of PubChem screening hits by activity-based protein profiling: MMP13 as a case study.Bioorganic & medicinal chemistry, 17 3
(2007)
Systems chemical biology, 3
F. Fan, K. Wood (2007)
Bioluminescent assays for high-throughput screening.Assay and drug development technologies, 5 1
Gang Wu, E. Chang (2005)
KBA: kernel boundary alignment considering imbalanced data distributionIEEE Transactions on Knowledge and Data Engineering, 17
Derick Weis, D. Visco, J. Faulon (2008)
Data mining PubChem using a support vector machine with the Signature molecular descriptor: classification of factor XIa inhibitors.Journal of molecular graphics & modelling, 27 4
Corinna Cortes, V. Vapnik (1995)
Support-Vector NetworksMachine Learning, 20
X. Xie, Jian-zhong Chen (2008)
Data Mining a Small Molecule Drug Screening Representative Subset from NIH PubChemJournal of chemical information and modeling, 48 3
E. Zerhouni (2003)
Medicine. The NIH Roadmap.Science, 302 5642
Qingliang Li, A. Bender, Jianfeng Pei, L. Lai (2007)
A Large Descriptor Set and a Probabilistic Kernel-Based Classifier Significantly Improve Druglikeness ClassificationJournal of chemical information and modeling, 47 5
Yanli Wang, Jewen Xiao, Tugba Suzek, Jian Zhang, Jiyao Wang, S. Bryant (2009)
PubChem: a public information system for analyzing bioactivities of small moleculesNucleic Acids Research, 37
Qingliang Li, L. Lai (2007)
Prediction of potential drug targets based on simple sequence propertiesBMC Bioinformatics, 8
Gary Weiss (2004)
Mining with rarity: a unifying frameworkSIGKDD Explor., 6
James Inglese, Ronald Johnson, A. Simeonov, M. Xia, Wei Zheng, C. Austin, D. Auld (2007)
High-throughput screening assays for the identification of chemical probes.Nature chemical biology, 3 8
D. Diller, D. Hobbs (2004)
Deriving knowledge through data mining high-throughput screening data.Journal of medicinal chemistry, 47 25
Sebastian Rohrer, K. Baumann (2009)
Maximum Unbiased Validation (MUV) Data Sets for Virtual Screening Based on PubChem Bioactivity DataJournal of chemical information and modeling, 49 2
J. Hur, D. Wild (2008)
PubChemSR: A search and retrieval tool for PubChemChemistry Central Journal, 2
H. Ovaa, F. Leeuwen (2008)
Chemical Biology Approaches to Probe the ProteomeChemBioChem, 9
G. Rosania, G. Crippen, P. Woolf, D. States, K. Shedden (2007)
A Cheminformatic Toolkit for Mining Biomedical KnowledgePharmaceutical Research, 24
M. Kubát, S. Matwin (1997)
Addressing the Curse of Imbalanced Training Sets: One-Sided Selection
R. Barandela, J. Sánchez, V. García, E. Rangel (2003)
Strategies for learning in class imbalance problemsPattern Recognit., 36
R. Guha, S. Schürer (2008)
Utilizing high throughput screening data for predictive toxicology models: protocols and application to MLSCN assaysJournal of Computer-Aided Molecular Design, 22
D. Auld, Ya-Qin Zhang, Noel Southall, Ganesha Rai, M. Landsman, Jennifer MacLure, Daniel Langevin, Craig Thomas, C. Austin, James Inglese (2009)
A basis for reduced chemical library inhibition of firefly luciferase obtained from directed evolution.Journal of medicinal chemistry, 52 5
Pilsung Kang, Sungzoon Cho (2006)
EUS SVMs: Ensemble of Under-Sampled SVMs for Data Imbalance Problems
D. Auld, Noel Southall, A. Jadhav, Ronald Johnson, D. Diller, A. Simeonov, C. Austin, James Inglese (2008)
Characterization of chemical libraries for luciferase inhibitory activity.Journal of medicinal chemistry, 51 8
(2006)
Clinical research at a crossroads: the NIH roadmap
Yuchun Tang, Yanqing Zhang (2006)
Granular SVM with Repetitive Undersampling for Highly Imbalanced Protein Homology Prediction2006 IEEE International Conference on Granular Computing
Motivation: The comprehensive information of small molecules and their biological activities in PubChem brings great opportunities for academic researchers. However, mining high-throughput screening (HTS) assay data remains a great challenge given the very large data volume and the highly imbalanced nature with only small number of active compounds compared to inactive compounds. Therefore, there is currently a need for better strategies to work with HTS assay data. Moreover, as luciferase-based HTS technology is frequently exploited in the assays deposited in PubChem, constructing a computational model to distinguish and filter out potential interference compounds for these assays is another motivation.Results: We used the granular support vector machines (SVMs) repetitive under sampling method (GSVM-RU) to construct an SVM from luciferase inhibition bioassay data that the imbalance ratio of active/inactive is high (1/377). The best model recognized the active and inactive compounds at the accuracies of 86.60% and 88.89 with a total accuracy of 87.74%, by cross-validation test and blind test. These results demonstrate the robustness of the model in handling the intrinsic imbalance problem in HTS data and it can be used as a virtual screening tool to identify potential interference compounds in luciferase-based HTS experiments. Additionally, this method has also proved computationally efficient by greatly reducing the computational cost and can be easily adopted in the analysis of HTS data for other biological systems.Availability: Data are publicly available in PubChem with AIDs of 773, 1006 and 1379.Contact: ywang@ncbi.nlm.nih.gov; bryant@ncbi.nlm.nih.govSupplementary information: Supplementary data are available at Bioinformatics online.
Bioinformatics – Oxford University Press
Published: Oct 13, 2009
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.