This Article 
 Bibliographic References 
 Add to: 
Compression, Clustering, and Pattern Discovery in Very High-Dimensional Discrete-Attribute Data Sets
April 2005 (vol. 17 no. 4)
pp. 447-461
This paper presents an efficient framework for error-bounded compression of high-dimensional discrete-attribute data sets. Such data sets, which frequently arise in a wide variety of applications, pose some of the most significant challenges in data analysis. Subsampling and compression are two key technologies for analyzing these data sets. The proposed framework, PROXIMUS, provides a technique for reducing large data sets into a much smaller set of representative patterns, on which traditional (expensive) analysis algorithms can be applied with minimal loss of accuracy. We show desirable properties of PROXIMUS in terms of runtime, scalability to large data sets, and performance in terms of capability to represent data in a compact form and discovery and interpretation of interesting patterns. We also demonstrate sample applications of PROXIMUS in association rule mining and semantic classification of term-document matrices. Our experimental results on real data sets show that use of the compressed data for association rule mining provides excellent precision and recall values (above 90 percent) across a range of problem parameters while reducing the time required for analysis drastically. We also show excellent interpretability of the patterns discovered by PROXIMUS in the context of clustering and classification of terms and documents. In doing so, we establish PROXIMUS as a tool for both preprocessing data before applying computationally expensive algorithms and directly extracting correlated patterns.

[1] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,” Proc. 20th Int'l Conf. Very Large Data Bases (VLDB '94), pp. 487-499, 1994.
[2] G.H. John and P. Langley, “Static versus Dynamic Sampling for Data Mining,” Proc. Second Int'l Conf. Knowledge Discovery and Data Mining (KDD '96), pp. 367-370, 1996.
[3] F.J. Provost, D. Jensen, and T. Oates, “Efficient Progressive Sampling,” Knowledge Discovery and Data Mining, pp. 23-32, 1999.
[4] F.J. Provost and V. Kolluri, “A Survey of Methods for Scaling Up Inductive Algorithms,” Data Mining and Knowledge Discovery, vol. 3, no. 2, pp. 131-169, 1999.
[5] H. Toivonen, “Sampling Large Databases for Association Rules,” Proc. 22th Int'l Conf. Very Large Databases (VLDB '96), pp. 134-145, 1996.
[6] M.J. Zaki, S. Parthasarathy, W. Li, and M. Ogihara, “Evaluation of Sampling for Data Mining of Association Rules,” Proc. Seventh Int'l Workshop Research Issues Data Eng. (RIDE '97), p. 42, 1997.
[7] M.W. Berry, S.T. Dumais, and G.W. O'Brien, “Using Linear Algebra for Intelligent Information Retrieval,” SIAM Rev., vol. 37, no. 4, pp. 573-595, 1995.
[8] T.G. Kolda and D.P. O'Leary, “Computation and Uses of the Semidiscrete Matrix Decomposition,” ACM Trans. Information Processing, 1999.
[9] M.T. Chu and R.E. Funderlic, “The Centroid Decomposition: Relationships between Discrete Variational Decompositions and SVDs,” SIAM J. Matrix Analysis and Applications, vol. 23, no. 4, pp. 1025-1044, 2002.
[10] D. Boley, “Principal Direction Divisive Partitioning,” Data Mining and Knowledge Discovery, vol. 2, no. 4, pp. 325-344, 1998.
[11] I.T Joliffe, Principal Component Analysis. Springer-Verlag, 1986.
[12] H.H. Harman, Modern Factor Analysis. Univ. of Chicago Press, 1967.
[13] T. Hofmann, “Probabilistic Latent Semantic Analysis,” Proc. 15th Conf. Uncertainty in Artificial Intelligence (UAI '99), 1999.
[14] A.P. Dempster, N.M. Laird, and D.B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” J. Royal Statistical Soc. Series B (Methodological), vol. 39, no. 1, pp. 1-38, 1977.
[15] A. Hyvärinen, J. Karhunen, and E. Oja, Independent Component Analysis. John Wiley & Sons, 2001.
[16] R.M. Gray, “Vector Quantization,” IEEE ASSP Magazine, vol. 1, no. 2, pp. 4-29, 1984.
[17] J. MacQueen, “Some Methods for Classification and Analysis of Multivariate Observations,” Proc. Fifth Berkeley Symp., vol. 1, pp. 281-297, 1967.
[18] Z. Huang, “A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining,” Proc. SIGMOD Workshop Research Issues Data Mining and Knowledge Discovery, 1997.
[19] D. Gibson, J. Kleingberg, and P. Raghavan, “Clustering Categorical Data: An Approach Based on Dynamical Systems,” The VLDB J., vol. 8, nos. 3-4, pp. 222-236, 2000.
[20] S. Guha, R. Rastogi, and K. Shim, “ROCK: A Robust Clustering Algorithm for Categorical Attributes,” Information Systems, vol. 25, no. 5, pp. 345-366, 2000.
[21] E.H. Han, G. Karypis, V. Kumar, and B. Mobasher, “Hypergraph-Based Clustering in High-Dimensional Data Sets: A Summary of Results,” Bull. IEE Technical Comm. Data Eng., vol. 21, no. 1, pp. 15-22, 1998.
[22] M. Özdal and C. Aykanat, “Hypergraph Models and Algorithms for Data-Pattern Based Clustering,” Data Mining and Knowledge Discovery, vol. 9, no. 1, pp. 29-57, 2004.
[23] M. Koyutürk, A. Grama, and N. Ramakrishnan, “Algebraic Techniques for Analysis of Large Discrete-Valued Data Sets,” Proc. Sixth European Conf. Principles of Data Mining and Knowledge Discovery (PKDD '02), pp. 311-324, 2002.
[24] C. Bron and J. Kerbosch, “Finding All Cliques in an Undirected Graph,” Comm. ACM, vol. 16, pp. 575-577, 1973.
[25] R. Peeters, “The Maximum Edge Biclique Problem is NP-Complete,” Discrete Applied Math., vol. 131, no. 3, pp. 651-654, 2003.
[26] G. Karypis and V. Kumar, “A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs,” SIAM J. Scientific Computing, vol. 20, no. 1, pp. 359-392, 1998.
[27] J. Chi, M. Koyutürk, and A. Grama, “Conquest: A Distributed Tool for Constructing Summaries of High-Dimensional Discrete-Attributed Data Sets,” Proc. Fourth SIAM Int'l Conf. Data Mining (SDM '04), pp. 154-165, 2004.
[28] C. Borgelt and R. Kruse, “Induction of Association Rules: Apriori Implementation,” Proc. 15th Conf. Computational Statistics, 2002.
[29] M. Koyutürk and A. Grama, “Proximus: A Framework for Analyzing Very High Dimensional Discrete-Attributed Data Sets,” Proc. Ninth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '03), pp. 147-156, 2003.

Index Terms:
Clustering, classification, association rules, data mining, sparse, structured and very large systems, singular value decomposition.
Mehmet Koyut?, Ananth Grama, Naren Ramakrishnan, "Compression, Clustering, and Pattern Discovery in Very High-Dimensional Discrete-Attribute Data Sets," IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 4, pp. 447-461, April 2005, doi:10.1109/TKDE.2005.55
Usage of this product signifies your acceptance of the Terms of Use.