This Article 
 Bibliographic References 
 Add to: 
Toward Unsupervised Correlation Preserving Discretization
September 2005 (vol. 17 no. 9)
pp. 1174-1185
Discretization is a crucial preprocessing technique used for a variety of data warehousing and mining tasks. In this paper, we present a novel PCA-based unsupervised algorithm for the discretization of continuous attributes in multivariate data sets. The algorithm leverages the underlying correlation structure in the data set to obtain the discrete intervals and ensures that the inherent correlations are preserved. Previous efforts on this problem are largely supervised and consider only piecewise correlation among attributes. We consider the correlation among continuous attributes and, at the same time, also take into account the interactions between continuous and categorical attributes. Our approach also extends easily to data sets containing missing values. We demonstrate the efficacy of the approach on real data sets and as a preprocessing step for both classification and frequent itemset mining tasks. We show that the intervals are meaningful and can uncover hidden patterns in data. We also show that large compression factors can be obtained on the discretized data sets. The approach is task independent, i.e., the same discretized data set can be used for different data mining tasks. Thus, the data sets can be discretized, compressed, and stored once and can be used again and again.

[1] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,” Proc. 20th Very Large Database Conf., pp. 487-499, 1994.
[2] S.D. Bay, “Multivariate Discretization for Set Mining,” Knowledge and Information Systems, vol. 3, no. 4, pp. 491-512, 2001.
[3] J. Catlett, “Changing Continuous Attributes into Ordered Discrete Attributes,” Proc. European Working Session on Learning, pp. 164-178, 1991.
[4] J. Dougherty, R. Kohavi, and M. Sahami, “Supervised and Unsupervised Discretization of Continuous Features,” Proc. Int'l Conf. Machine Learning, pp. 194-202, 1995.
[5] U.M. Fayyad and K.B. Irani, “Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning,” Proc. 13th Joint Conf. Artificial Intelligence, pp. 1022-1029, 1993.
[6] T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama, “Mining Optimized Association Rules for Numeric Attributes,” Proc. 15th ACM SIGACT-SIGMOD-SIGART Symp. Principles of Database Systems, pp. 182-191, 1996.
[7] A. Ghoting, M. Otey, and S. Parthasarathy, “Loaded: Link-Based Outlier and Anomaly Detection in Evolving Data Sets,” Proc. Int'l Conf. Data Mining, pp. 387-390, 2004.
[8] I.T. Jolliffe, Principal Component Analysis. Springer-Verlag, 1986.
[9] R. Kerber, “Chimerge: Discretization of Numeric Attributes,” Proc. Nat'l Conf. Artificial Intelligence, pp. 123-128, 1991.
[10] J.-O. Kim and C.W. Mueller, Factor Analysis: Statistical Methods and Practical Issues. Sage Publications, 1978.
[11] R. Kohavi and M. Sahami, “Error-Based and Entropy-Based Discretization of Continuous Features,” Proc. Second Int'l Conf. Knowledge Discovery and Data Mining, pp. 114-119, 1996.
[12] M.-C. Ludl and G. Widmer, “Relative Unsupervised Discretization for Association Rule Mining,” Proc. Fourth European Conf. Principles and Practice of Knowledge Discovery in Databases, pp. 148-158, 2000.
[13] C. Papadiitriou, P. Raghavan, H. Tamaki, and S. Vempala, “Latent Semantic Indexing: A Probabilistic Analysis,” Proc. ACM Symp. Principles of Database Systems, pp. 159-168, 1998.
[14] S. Parthasarathy, “Efficient Progressive Sampling for Association Rules,” Proc. IEEE Int'l Conf. Data Mining, pp. 354-361, 2002.
[15] S. Parthasarathy and C.C. Aggarwal, “On the Use of Conceptual Reconstruction for Mining Massively Incomplete Data Sets,” IEEE Trans. Knowledge and Data Eng., pp. 1512-1521, 2003.
[16] S. Parthasarathy and M. Ogihara, “Clustering Homogeneous Distributed Data Sets,” Proc. Int'l Conf. Practical Applications of Knowledge Discovery and Data Mining, pp. 566-574, 2000.
[17] S. Parthasarathy and A. Ramakrishnan, “Parallel Incremental 2D-Discretization on Dynamic Data Sets,” Proc. Int'l Parallel and Distributed Processing Symp., pp. 247-254, 2002.
[18] S. Parthasarathy, R. Subramonian, and R. Venkata, “Generalized Discretization for Summarization and Classification,” Proc. Practical Applications of Discovery and Data Mining, pp. 219-239, 1998.
[19] J.R. Quinlan, C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
[20] E. Rabani and S. Toledo, “Out-of-Core SVD and QR Decompositions,” Proc. 10th SIAM Conf. Parallel Processing for Scientific Computing, p. 10, 2001.
[21] R. Rastogi and K. Shim, “Mining Optimized Association Rules with Categorical and Numeric Attributes,” Knowledge and Data Eng., vol. 14, no. 1, pp. 29-50, 2002.
[22] R. Srikant and R. Agrawal, “Mining Quantitative Association Rules in Large Relational Tables,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 1-12, 1996.
[23] R. Subramonian, R. Venkata, and J. Chen, “A Visual Interactive Framework for Attribute Discretization,” Proc. Third Conf. Knowledge and Data Discovery, pp. 218-225, 1997.
[24] R. Vilalta, G. Blix, and L. Rendell, “Global Data Analysis and the Fragmentation Problem in Decision Tree Induction,” Proc. Ninth European Conf. Machine Learning, pp. 312-326, 1997.

Index Terms:
Index Terms- Data preprocessing, principal component analysis, data mining/summarization, missing data, data compression.
Sameep Mehta, Srinivasan Parthasarathy, Hui Yang, "Toward Unsupervised Correlation Preserving Discretization," IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 9, pp. 1174-1185, Sept. 2005, doi:10.1109/TKDE.2005.153
Usage of this product signifies your acceptance of the Terms of Use.