TY - JOUR
AU - Rathore, Santosh S.
AB - The use of regression methods, for instance, linear regression, decision tree regression, etc., has been used earlier to build software fault prediction (SFP) models. However, these methods showed limited SFP performance with higher misclassification errors. In previous works, issues such as multicollinearity, feature scaling, and imbalance distribution of faulty and non-faulty modules in the dataset have not been considered reasonably, which might be a potential cause behind the poor prediction performance of these regression methods. Motivated from it, in this paper, we investigate the impact of 15 different regression methods for the faults count prediction in the software system and report their interpretation for fault models. We consider different fault data quality issues, and a comprehensive assessment of the regression methods is presented to handle these issues. We believe that many used regression methods have not been explored before for the SFP by considering different data quality issues. In the presented study, 44 fault datasets and their versions are used that are collected from the PROMISE software data repository are used to validate the performance of the regression methods, and absolute relative error (ARE), root mean square error (RSME), and fault-percentile-average (FPA) are used as the performance measures. For the model building, five different scenarios are considered, (1) original dataset without preprocessing; (2) standardized processed dataset; (3) balanced dataset; (4) non-multicollinearity processed dataset; (5) balanced+non-multicollinearity processed dataset. Experimental results showed that overall kernel-based regression methods, KernelRidge and SVR (Support vector regression, both linear and nonlinear kernels), yielded the best performance for predicting the fault counts compared to other methods. Other regression methods, in particular NNR (Nearest neighbor regression), RFR (Random forest regression), and GBR (Gradient boosting regression), are performed significantly accurately. Further, results showed that applying standardization and handling multicollinearity in the fault dataset helped improve regression methods’ performance. It is concluded that regression methods are promising for building software fault prediction models.
TI - An exploratory analysis of regression methods for predicting faults in software systems
JO - Soft Computing
DO - 10.1007/s00500-021-06048-x
DA - 2021-12-01
UR - https://www.deepdyve.com/lp/springer-journals/an-exploratory-analysis-of-regression-methods-for-predicting-faults-in-mN0lldcLUJ
SP - 14841
EP - 14872
VL - 25
IS - 23
DP - DeepDyve
ER -