|
| This Article | ||
| ||
| Share | ||
| Bibliographic References | ||
| Add to: | ||
| | ||
| Search | ||
| ||
16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'04)
Noise Identification with the k-Means Algorithm
Boca Raton, Florida
November 15-November 17
ISBN: 0-7695-2236-X
| ASCII Text | x | ||
| Wei Tang, Taghi M. Khoshgoftaar, "Noise Identification with the k-Means Algorithm," 2012 IEEE 24th International Conference on Tools with Artificial Intelligence, pp. 373-378, 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'04), 2004. | |||
| BibTex | x | ||
| @article{ 10.1109/ICTAI.2004.93, author = {Wei Tang and Taghi M. Khoshgoftaar}, title = {Noise Identification with the k-Means Algorithm}, journal ={2012 IEEE 24th International Conference on Tools with Artificial Intelligence}, volume = {0}, year = {2004}, issn = {1082-3409}, pages = {373-378}, doi = {http://doi.ieeecomputersociety.org/10.1109/ICTAI.2004.93}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, } | |||
| RefWorks Procite/RefMan/Endnote | x | ||
| TY - CONF JO - 2012 IEEE 24th International Conference on Tools with Artificial Intelligence TI - Noise Identification with the k-Means Algorithm SN - 1082-3409 SP373 EP378 A1 - Wei Tang, A1 - Taghi M. Khoshgoftaar, PY - 2004 KW - data noise KW - outliers KW - software quality estimation KW - software metrics KW - k-means KW - clustering VL - 0 JA - 2012 IEEE 24th International Conference on Tools with Artificial Intelligence ER - | |||
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/ICTAI.2004.93
The presence of noise in a measurement dataset can have a negative effect on the classification model built. More specifically, the noisy instances in the dataset can adversely affect the learnt hypothesis. Removal of noisy instances will improve the learnt hypothesis; thus, improving the classification accuracy of the model. A clustering-based noise detection approach using the k-means algorithm is presented. We present a new metric for measuring the potentiality (noise factor) of an instance being noisy. Based on the computed noise factor values of the instances, the clustering-based algorithm is then used to identify and eliminate p% of the instances in the dataset. These p% of instances are considered the most likely to be noisy among the instances in the dataset — the p% value is varied from 1% to 40%. THe noise detection approach is investigated with respect to two case studies of software measurement data obtained from NASA software projects. The two datasets are characterized by the same thirteen software metrics and a class labled that classifies the program modules as a fault-prone and not fault-prone. It is shown that as more noisy instances are removed, classification accuracy of the C4.5 learner improves. This indicates that the removed instances are most likely noisy instances that attributed to poor classification accuracy.
Index Terms:
data noise, outliers, software quality estimation, software metrics, k-means, clustering
Citation:
Wei Tang, Taghi M. Khoshgoftaar, "Noise Identification with the k-Means Algorithm," ictai, pp.373-378, 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'04), 2004
Usage of this product signifies your acceptance of the Terms of Use.
