Access the full text.
Sign up today, get DeepDyve free for 14 days.
Gustavo Batista, R. Prati, M. Monard (2004)
A study of the behavior of several methods for balancing machine learning training dataSIGKDD Explor., 6
T. Khoshgoftaar, E. Allen, Jianyu Deng (2002)
Using regression trees to classify fault-prone software modulesIEEE Trans. Reliab., 51
Andrew Estabrooks, N. Japkowicz (2001)
A Mixture-of-Experts Framework for Learning from Imbalanced Data Sets
M. Maloof (2003)
Learning When Data Sets are Imbalanced and When Costs are Unequal and Unknown
K. Veropoulos, I. Campbell, N. Cristianini (1999)
Controlling the Sensitivity of Support Vector Machines
F. Provost, Tom Fawcett (2000)
Robust Classification for Imprecise EnvironmentsMachine Learning, 42
M. Berenson, D. Levine, M. Goldstein (1982)
Intermediate Statistical Methods and Applications: A Computer Package Approach
B. Scholkopf, C. Burges, Alex Smola (1999)
Advances in kernel methods: support vector learning
B. Ripley, N. Hjort (1996)
Pattern recognition and neural networks
T. Khoshgoftaar, Naeem Seliya (2004)
Comparative Assessment of Software Quality Classification Techniques: An Empirical Case StudyEmpirical Software Engineering, 9
J. Moody, C. Darken (1989)
Fast Learning in Networks of Locally-Tuned Processing UnitsNeural Computation, 1
C. Drummond, R. Holte (2003)
C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling
T. Khoshgoftaar, E. Allen, W. Jones, J. Hudepohl (1999)
Classification tree models of software quality over multiple releasesProceedings 10th International Symposium on Software Reliability Engineering (Cat. No.PR00443)
T. Khoshgoftaar, E. Allen (1999)
LOGISTIC REGRESSION MODELING OF SOFTWARE QUALITYInternational Journal of Reliability, Quality and Safety Engineering, 06
J. Hulse, T. Khoshgoftaar, Amri Napolitano (2007)
Experimental perspectives on learning from imbalanced data
R. Akbani, Stephen Kwek, N. Japkowicz (2004)
Applying Support Vector Machines to Imbalanced Datasets
William Cohen (1995)
Fast Effective Rule Induction
S. Hong (1997)
Data miningFuture Gener. Comput. Syst., 13
James Franklin (2005)
The elements of statistical learning: data mining, inference and predictionThe Mathematical Intelligencer, 27
C. Ling, Chenghui Li (1998)
Data Mining for Direct Marketing: Problems and Solutions
S. Kotsiantis, P. Pintelas (2003)
Mixture of Expert Agents for Handling Imbalanced Data Sets
Gary Weiss, F. Provost (2003)
Learning When Training Data are Costly: The Effect of Class Distribution on Tree InductionArXiv, abs/1106.4557
Hui Han, Wenyuan Wang, Binghuan Mao (2005)
Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning
N. Chawla, K. Bowyer, L. Hall, W. Kegelmeyer (2002)
SMOTE: Synthetic Minority Over-sampling TechniqueArXiv, abs/1106.1813
D. Wilson (1972)
Asymptotic Properties of Nearest Neighbor Rules Using Edited DataIEEE Trans. Syst. Man Cybern., 2
L. Breiman (2001)
Random ForestsMachine Learning, 45
R. Barandela, R. Valdovinos, J. Sánchez, F. Ferri (2004)
The Imbalanced Training Sample Problem: Under or over Sampling?
T. Khoshgoftaar, E. Allen (1998)
Classification of Fault-Prone Software Modules: Prior Probabilities, Costs, and Model EvaluationEmpirical Software Engineering, 3
M. Kubát, S. Matwin (1997)
Addressing the Curse of Imbalanced Training Sets: One-Sided Selection
D. Aha (1997)
Lazy Learning
T. Khoshgoftaar, Chris Seiffert, J. Hulse, Amri Napolitano, A. Folleco (2007)
Learning with limited minority class dataSixth International Conference on Machine Learning and Applications (ICMLA 2007)
N. Japkowicz, Shaju Stephen (2002)
The class imbalance problem: A systematic studyIntell. Data Anal., 6
T. Khoshgoftaar, Naeem Seliya, Kehan Gao (2005)
Detecting noisy instances with the rule-based classification modelIntell. Data Anal., 9
Willie Ng, M. Dash (2006)
An Evaluation of Progressive Sampling for Imbalanced Data SetsSixth IEEE International Conference on Data Mining - Workshops (ICDMW'06)
S. Salzberg, Alberto Segre (1994)
Programs for Machine Learning
Catherine Blake (1998)
UCI Repository of machine learning databases
Building a classification model on imbalanced datasets can be a challenging endeavor. Models built on data where examples of one class are greatly outnumbered by examples of the other class(es) tend to sacrifice accuracy with respect to the underrepresented class in favor of maximizing the overall classification rate. Several methods have been suggested to alleviate the problem of class imbalance. One common technique that has received much attention in recent research is data sampling. Data sampling either adds examples to the minority class (oversampling) or removes examples from the majority class (undersampling) in order to create a more balanced data set. Both oversampling and undersampling have their strengths and drawbacks. In this work we propose a hybrid sampling procedure that uses a combination of two sampling techniques to create a balanced data set. By using more than one sampling technique, we can combine the strengths of the individual techniques while lessening the drawbacks. We perform a comprehensive set of experiments, with more than one million classifiers built, showing that our hybrid sampling procedure almost always outperforms the individual sampling techniques.
Integrated Computer-Aided Engineering – IOS Press
Published: Jan 1, 2009
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.