16th IEEE International Conference on Tools with Artificial Intelligence (2004)
Boca Raton, Florida
Nov. 15, 2004 to Nov. 17, 2004
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/ICTAI.2004.93
Wei Tang , Florida Atlantic University
Taghi M. Khoshgoftaar , Florida Atlantic University
The presence of noise in a measurement dataset can have a negative effect on the classification model built. More specifically, the noisy instances in the dataset can adversely affect the learnt hypothesis. Removal of noisy instances will improve the learnt hypothesis; thus, improving the classification accuracy of the model. A clustering-based noise detection approach using the k-means algorithm is presented. We present a new metric for measuring the potentiality (noise factor) of an instance being noisy. Based on the computed noise factor values of the instances, the clustering-based algorithm is then used to identify and eliminate p% of the instances in the dataset. These p% of instances are considered the most likely to be noisy among the instances in the dataset — the p% value is varied from 1% to 40%. THe noise detection approach is investigated with respect to two case studies of software measurement data obtained from NASA software projects. The two datasets are characterized by the same thirteen software metrics and a class labled that classifies the program modules as a fault-prone and not fault-prone. It is shown that as more noisy instances are removed, classification accuracy of the C4.5 learner improves. This indicates that the removed instances are most likely noisy instances that attributed to poor classification accuracy.
data noise, outliers, software quality estimation, software metrics, k-means, clustering
T. M. Khoshgoftaar and W. Tang, "Noise Identification with the k-Means Algorithm," 16th IEEE International Conference on Tools with Artificial Intelligence(ICTAI), Boca Raton, Florida, 2004, pp. 373-378.