The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.02 - Feb. (2013 vol.25)
pp: 387-401
Robson L.F. Cordeiro , University of São Paulo, São Carlos
Agma J.M. Traina , Universidade de São Paulo, São Carlos
Christos Faloutsos , Carnegie Mellon University, Pittsburgh
Caetano Traina Jr. , University of São Paulo, São Carlos
ABSTRACT
This paper proposes Halite, a novel, fast, and scalable clustering method that looks for clusters in subspaces of multidimensional data. Existing methods are typically superlinear in space or execution time. Halite's strengths are that it is fast and scalable, while still giving highly accurate results. Specifically the main contributions of Halite are: 1) Scalability: it is linear or quasi linear in time and space regarding the data size and dimensionality, and the dimensionality of the clusters' subspaces; 2) Usability: it is deterministic, robust to noise, doesn't take the number of clusters as an input parameter, and detects clusters in subspaces generated by original axes or by their linear combinations, including space rotation; 3) Effectiveness: it is accurate, providing results with equal or better quality compared to top related works; and 4) Generality: it includes a soft clustering approach. Experiments on synthetic data ranging from five to 30 axes and up to 1 \rm million points were performed. Halite was in average at least 12 times faster than seven representative works, and always presented highly accurate results. On real data, Halite was at least 11 times faster than others, increasing their accuracy in up to 35 percent. Finally, we report experiments in a real scenario where soft clustering is desirable.
INDEX TERMS
Shape, Correlation, Laplace equations, Convolution, Proposals, Accuracy, Complexity theory, data mining, Local-correlation clustering, moderate-to-high dimensional data
CITATION
Robson L.F. Cordeiro, Agma J.M. Traina, Christos Faloutsos, Caetano Traina Jr., "Halite: Fast and Scalable Multiresolution Local-Correlation Clustering", IEEE Transactions on Knowledge & Data Engineering, vol.25, no. 2, pp. 387-401, Feb. 2013, doi:10.1109/TKDE.2011.176
REFERENCES
[1] R.L.F. Cordeiro, A.J.M. Traina, C. Faloutsos, and C. TrainaJr., "Finding Clusters in Subspaces of Very Large, Multi-Dimensional Data Sets," Proc. IEEE 26th Int'l Conf. Data Eng. (ICDE), pp. 625-636, 2010.
[2] R.C. Gonzalez and R.E. Woods, Digital Image Processing, third ed. Prentice-Hall, Inc., 2006.
[3] P.D. Grunwald, I.J. Myung, and M.A. Pitt, Advances in Minimum Description Length: Theory and Applications (Neural Information Processing). The MIT Press, 2005.
[4] C. TrainaJr., A.J.M. Traina, C. Faloutsos, and B. Seeger, "Fast Indexing and Visualization of Metric Data Sets Using Slim-Trees," IEEE Trans. Knowledge Data Eng., vol. 14, no. 2, pp. 244-260, Mar./Apr. 2002.
[5] C. TrainaJr., A.J.M. Traina, L. Wu, and C. Faloutsos, "Fast Feature Selection Using Fractal Dimension," Proc. 15th Brazilian Symp. Databases (SBBD), pp. 158-171, 2000.
[6] H.-P. Kriegel, P. Kröger, and A. Zimek, "Clustering High-Dimensional Data: A Survey on Subspace Clustering, Pattern-Based Clustering, and Correlation Clustering," ACM Trans. Knowledge Discovery from Data, vol. 3, no. 1, pp. 1-58, 2009.
[7] C. Domeniconi, D. Gunopulos, S. Ma, B. Yan, M. Al-Razgan, and D. Papadopoulos, "Locally Adaptive Metrics for Clustering High Dimensional Data," Data Mining and Knowledge Discovery, vol. 14, no. 1, pp. 63-97, 2007.
[8] A.K.H. Tung, X. Xu, and B.C. Ooi, "Curler: Finding and Visualizing Nonlinear Correlation Clusters," Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 467-478, 2005.
[9] C. Aggarwal and P. Yu, "Redefining Clustering for High-Dimensional Applications," IEEE Trans. Knowledge and Data Eng., vol. 14, no. 2, pp. 210-225, Mar./Apr. 2002.
[10] E.K.K. Ng, A.W. chee Fu, and R.C.-W. Wong, "Projective Clustering by Histograms," IEEE Trans. Knowledge and Data Eng., vol. 17, no. 3, pp. 369-383, Mar. 2005.
[11] G. Moise, J. Sander, and M. Ester, "Robust Projected Clustering," Knowledge Information Systems, vol. 14, no. 3, pp. 273-298, 2008.
[12] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, "Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications," SIGMOD Record, vol. 27, no. 2, pp. 94-105, 1998.
[13] C.C. Aggarwal, J.L. Wolf, P.S. Yu, C. Procopiuc, and J.S. Park, "Fast Algorithms for Projected Clustering," SIGMOD Record, vol. 28, no. 2, pp. 61-72, 1999.
[14] M.L. Yiu and N. Mamoulis, "Iterative Projected Clustering by Subspace Mining," IEEE Trans. Knowledge and Data Eng., vol. 17, no. 2, pp. 176-189, Feb. 2005.
[15] K. Yip, D. Cheung, and M. Ng, "Harp: A Practical Projected Clustering Algorithm," IEEE Trans. Knowledge and Data Eng., vol. 16, no. 11, pp. 1387-1397, Nov. 2004.
[16] G. Moise and J. Sander, "Finding Non-Redundant, Statistically Significant Regions in High Dimensional Data: A Novel Approach to Projected and Subspace Clustering," Proc. 14th ACM SIGKDD Int'l Conf. Knowledge Discovery Data Mining (KDD), pp. 533-541, 2008.
[17] C. Böhm, K. Kailing, P. Kröger, and A. Zimek, "Computing Clusters of Correlation Connected Objects," Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 455-466, 2004.
[18] E. Achtert, C. Böhm, H.-P. Kriegel, P. Kröger, and A. Zimek, "Robust, Complete, and Efficient Correlation Clustering," Proc. Seventh SIAM Int'l Conf. Data Mining (SDM), 2007.
[19] E. Achtert, C. Böhm, J. David, P. Kröger, and A. Zimek, "Global Correlation Clustering Based on the Hough Transform," Statistical Analysis and Data Mining, vol. 1, pp. 111-127, Nov. 2008.
[20] W. Wang, J. Yang, and R. Muntz, "Sting: A Statistical Information Grid Approach to Spatial Data Mining," Proc. 23rd Int'l Conf. Very Large Data Bases (VLDB), pp. 186-195, 1997.
[21] C. Böhm, C. Faloutsos, J.-Y. Pan, and C. Plant, "Robust Information-Theoretic Clustering," Proc. 12th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD), pp. 65-75, 2006.
[22] C. Böhm, C. Faloutsos, and C. Plant, "Outlier-Robust Clustering Using Independent Components," Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 185-198, 2008.
6 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool