This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Enhancing Data Analysis with Noise Removal
March 2006 (vol. 18 no. 3)
pp. 304-319
Hui Xiong, IEEE
Michael Steinbach, IEEE Computer Society
Removing objects that are noise is an important goal of data cleaning as noise hinders most types of data analysis. Most existing data cleaning methods focus on removing noise that is the product of low-level data errors that result from an imperfect data collection process, but data objects that are irrelevant or only weakly relevant can also significantly hinder data analysis. Thus, if the goal is to enhance the data analysis as much as possible, these objects should also be considered as noise, at least with respect to the underlying analysis. Consequently, there is a need for data cleaning techniques that remove both types of noise. Because data sets can contain large amounts of noise, these techniques also need to be able to discard a potentially large fraction of the data. This paper explores four techniques intended for noise removal to enhance data analysis in the presence of high noise levels. Three of these methods are based on traditional outlier detection techniques: distance-based, clustering-based, and an approach based on the Local Outlier Factor (LOF) of an object. The other technique, which is a new method that we are proposing, is a hyperclique-based data cleaner (HCleaner). These techniques are evaluated in terms of their impact on the subsequent data analysis, specifically, clustering and association analysis. Our experimental results show that all of these methods can provide better clustering performance and higher quality association patterns as the amount of noise being removed increases, although HCleaner generally leads to better clustering performance and higher quality associations than the other three methods for binary data.

[1] R. Agrawal, T. Imielinski, and A. Swami, “Mining Association Rules between Sets of Items in Large Databases,” Proc. ACM SIGMOD, 1993.
[2] F. Angiulli and C. Pizzuti, “Fast Outlier Detection in High-Dimensional Spaces,” Proc. Sixth European Conf. Principles of Data Mining and Knowledge Discovery, 2002.
[3] S.D. Bay and M. Schwabacher, “Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule,” Proc. Ninth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 29-38, 2003.
[4] M.M. Breunig, H.-P. Kriegel, R.T. Ng, and J. Sander, “LOF: Identifing Density Based Local Outliers,” Proc. 2000 ACM SIGMOD Int'l Conf. Management of Data, 2000.
[5] C.E. Brodley and M.A. Friedl, “Identifying Mislabeled Training Data,” J. Artificial Intelligence Research, vol. 11, pp. 131-167, 1999.
[6] www.dictionary.com, 2005.
[7] M.B. Eisen, P.T. Spellman, P.O. Browndagger, and D. Botstein, “Cluster Analysis and Display of Genome-Wide Expression Patterns,” Proc. Nat'l Academy of Sciences of the United States of Am. (PNAS), vol. 95, no. 25, 1998.
[8] L. Ertöz, M. Steinbach, and V. Kumar, “Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data,” Proc. Third SIAM Int'l Conf. Data Mining, May 2003.
[9] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A Density-Based Algorithm for Discoverying Clusters in Large Spatial Databases with Noise,” Proc. Second Int'l Conf. Knowledge Discovery and Data Mining, 1996.
[10] A. Gavin et al., “Functional Organization of the Yeast Proteome by Systematic Analysis of Protein Complexes,” Nature, 415, pp. 141-147, 2002.
[11] V. Gaede and O. Günther, “Multidimensional Access Methods,” ACM Computing Surveys, vol. 30, no. 2, pp. 170-231, 1998.
[12] H. Galhardas, D. Florescu, D. Shasha, and E. Simon, “Ajax: An Extensible Data Cleaning Tool,” Proc. ACM SIGMOD Int'l Conf. Management of Data, 2000.
[13] H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C. Saita, “Declarative Data Cleaning: Language, Model, and Algorithms,” Proc. 2001 Very Large Data Bases (VLDB) Conf., 2001.
[14] S. Guha, R. Rastogi, and K. Shim, “Cure: An Efficient Clustering Algorithm for Large Databases,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 73-84, June 1998.
[15] E.-H. Han, D. Boley, M. Gini, R. Gross, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore, “Webace: A Web Agent for Document Categorization and Exploration,” Proc. Second Int'l Conf. Autonomous Agents, 1998.
[16] M. Hernandez and S. Stolfo, “The Merge/Purge Problem for Large Databases,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 127-138, May 1995.
[17] M.A. Hernandez and S.J. Stolfo, “Real-World Data Is Dirty: Data Cleansing and the Merge/Purge Problem,” Data Mining and Knowldge Discovery, vol. 2, pp. 9-37, 1998.
[18] V.J. Hodge and J. Austin, “A Survey of Outlier Detection Methodologies,” Artificial Intelligence Rev., vol. 22, pp. 85-126, 2004.
[19] A.K. Jain and R.C. Dubes, Algorithms for Clustering Data. Prentice Hall Advanced Reference Series, Englewood Cliffs, N.J.: Prentice Hall, Mar. 1988, http://www.cse.msu. edu/~jainClustering_ Jain_Dubes.pdf .
[20] G. Karypis, “Cluto: Software for Clustering High Dimensional Data Sets,” www.cs.umn.edu~karypis, 2005.
[21] E.M. Knorr, R.T. Ng, and V. Tucakov, “Distance-Based Outliers: Algorithms and Applications,” VLDB J.: Very Large Databases, vol. 8, pp. 237-253, 2000.
[22] R. Kohavi and G.H. John, “Wrappers for Feature Subset Selection,” Artificial Intelligence, vol. 97, nos. 1-2, pp. 273-324, 1997.
[23] B. Larsen and C. Aone, “Fast and Effective Text Mining Using Linear-Time Document Clustering,” Proc. Fifth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 16-22, 1999.
[24] M.L. Lee, T.W. Ling, and W.L. Low, “Intelliclean: A Knowledge-Based Intelligent Data Cleaner,” Proc. Sixth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, 2000.
[25] D. Lewis, “Reuters-21578 Text Categorization Text Collection 1.0,” http://www.research.att.comlewis, 1997.
[26] Infoshare Limited, “Best Value Guide to Data Standardization,” InfoDB, July 1998, http:/www.infoshare.ltd.uk.
[27] A.E. Monge and C.P. Elkan, “An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records,” Proc. ACM-SIGMOD Workshop Research Issues on Knowledge Discovery and Data Mining, 1997.
[28] K. Orr, “Data Quality and Systems Theory,” Comm. ACM, vol. 41, pp. 66-71, 1998.
[29] M.F. Porter, “An Algorithm for Suffix Stripping,” Program, vol. 14, no. 3, 1980.
[30] L. Portnoy, E. Eskin, and S.J. Stolfo, “Intrusion Detection with Unlabeled Data Using Clustering,” Proc. ACM CSS Workshop Data Mining Applied to Security (DMSA-2001), 2001.
[31] S. Ramaswamy, R. Rastogi, and S. Kyuseok, “Efficient Algorithms for Mining Outliers from Large Data Sets,” Proc. ACM SIGMOD Int'l Conf. Management of Data, 2000.
[32] T. Redman, “The Impact of Poor Data Quality on the Typical Enterprise,” Comm. ACM, vol. 41, pp. 79-82, 1998.
[33] C.J. Van Rijsbergen, Information Retrieval, second ed. London: Butterworths, 1979.
[34] J. Sander, M. Ester, H.-P. Kriegel, and X. Xu, “Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications,” Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 169-194, 1998.
[35] G. Sheikholeslami, S. Chatterjee, and A. Zhang, “Wavecluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases,” Proc. Int'l Conf. Very Large Databases, 1998.
[36] P.-N. Tan, V. Kumar, and J. Srivastava, “Selecting the Right Objective Measure for Association Analysis,” Information Systems, vol. 29, no. 4, pp. 293-313, 2004.
[37] P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining. Pearson Addison-Wesley, 2005.
[38] TREC, Text Retrieval Conference, http:/trec.nist.gov, 2005.
[39] H. Xiong, P.-N. Tan, and V. Kumar, “Mining Hyperclique Patterns with Confidence Pruning,” Technical Report 03-006, Dept. of Computer Science, Univ. of Minnesota-Twin Cities, Jan. 2003.
[40] H. Xiong, P.-N. Tan, and V. Kumar, “Mining Strong Affinity Association Patterns in Data Sets with Skewed Support Distribution,” Proc. Third IEEE Int'l Conf. Data Mining, pp. 387-394, 2003.
[41] Y. Yang, “Noise Reduction in a Statistical Approach to Text Categorization,” Proc. 18th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 256-263, 1995.
[42] L. Yi, B. Liu, and X. Li, “Eliminating Noisy Information in Web Pages for Data Mining,” Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery & Data Mining, pp. 296-305, 2003.
[43] T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: an Efficient Data Clustering Method for Very Large Databases,” Proc. 1996 ACM SIGMOD Int'l Conf. Management of data, pp. 103-114, 1996.

Index Terms:
Index Terms- Data cleaning, very noisy data, hyperclique pattern discovery, local outlier factor (LOF), noise removal.
Citation:
Hui Xiong, Gaurav Pandey, Michael Steinbach, Vipin Kumar, "Enhancing Data Analysis with Noise Removal," IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 3, pp. 304-319, March 2006, doi:10.1109/TKDE.2006.46
Usage of this product signifies your acceptance of the Terms of Use.