Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

A novel method for mining highly imbalanced high-throughput screening data in PubChem

A novel method for mining highly imbalanced high-throughput screening data in PubChem Motivation: The comprehensive information of small molecules and their biological activities in PubChem brings great opportunities for academic researchers. However, mining high-throughput screening (HTS) assay data remains a great challenge given the very large data volume and the highly imbalanced nature with only small number of active compounds compared to inactive compounds. Therefore, there is currently a need for better strategies to work with HTS assay data. Moreover, as luciferase-based HTS technology is frequently exploited in the assays deposited in PubChem, constructing a computational model to distinguish and filter out potential interference compounds for these assays is another motivation.Results: We used the granular support vector machines (SVMs) repetitive under sampling method (GSVM-RU) to construct an SVM from luciferase inhibition bioassay data that the imbalance ratio of active/inactive is high (1/377). The best model recognized the active and inactive compounds at the accuracies of 86.60% and 88.89 with a total accuracy of 87.74%, by cross-validation test and blind test. These results demonstrate the robustness of the model in handling the intrinsic imbalance problem in HTS data and it can be used as a virtual screening tool to identify potential interference compounds in luciferase-based HTS experiments. Additionally, this method has also proved computationally efficient by greatly reducing the computational cost and can be easily adopted in the analysis of HTS data for other biological systems.Availability: Data are publicly available in PubChem with AIDs of 773, 1006 and 1379.Contact: ywang@ncbi.nlm.nih.gov; bryant@ncbi.nlm.nih.govSupplementary information: Supplementary data are available at Bioinformatics online. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

A novel method for mining highly imbalanced high-throughput screening data in PubChem

Bioinformatics , Volume 25 (24): 7 – Oct 13, 2009
7 pages

Loading next page...
 
/lp/oxford-university-press/a-novel-method-for-mining-highly-imbalanced-high-throughput-screening-2so0M7A4v0

References (33)

Publisher
Oxford University Press
Copyright
© The Author(s) 2009. Published by Oxford University Press.
ISSN
1367-4803
eISSN
1460-2059
DOI
10.1093/bioinformatics/btp589
pmid
19825798
Publisher site
See Article on Publisher Site

Abstract

Motivation: The comprehensive information of small molecules and their biological activities in PubChem brings great opportunities for academic researchers. However, mining high-throughput screening (HTS) assay data remains a great challenge given the very large data volume and the highly imbalanced nature with only small number of active compounds compared to inactive compounds. Therefore, there is currently a need for better strategies to work with HTS assay data. Moreover, as luciferase-based HTS technology is frequently exploited in the assays deposited in PubChem, constructing a computational model to distinguish and filter out potential interference compounds for these assays is another motivation.Results: We used the granular support vector machines (SVMs) repetitive under sampling method (GSVM-RU) to construct an SVM from luciferase inhibition bioassay data that the imbalance ratio of active/inactive is high (1/377). The best model recognized the active and inactive compounds at the accuracies of 86.60% and 88.89 with a total accuracy of 87.74%, by cross-validation test and blind test. These results demonstrate the robustness of the model in handling the intrinsic imbalance problem in HTS data and it can be used as a virtual screening tool to identify potential interference compounds in luciferase-based HTS experiments. Additionally, this method has also proved computationally efficient by greatly reducing the computational cost and can be easily adopted in the analysis of HTS data for other biological systems.Availability: Data are publicly available in PubChem with AIDs of 773, 1006 and 1379.Contact: ywang@ncbi.nlm.nih.gov; bryant@ncbi.nlm.nih.govSupplementary information: Supplementary data are available at Bioinformatics online.

Journal

BioinformaticsOxford University Press

Published: Oct 13, 2009

There are no references for this article.