This Article 
 Bibliographic References 
 Add to: 
Density-Based Multiscale Data Condensation
June 2002 (vol. 24 no. 6)
pp. 734-747

A problem gaining interest in pattern recognition applied to data mining is that of selecting a small representative subset from a very large data set. In this article, a nonparametric data reduction scheme is suggested. It attempts to represent the density underlying the data. The algorithm selects representative points in a multiscale fashion which is novel from existing density-based approaches. The accuracy of representation by the condensed set is measured in terms of the error in density estimates of the original and reduced sets. Experimental studies on several real life data sets show that the multiscale approach is superior to several related condensation methods both in terms of condensation ratio and estimation error. The condensed set obtained was also experimentally shown to be effective for some important data mining tasks like classification, clustering, and rule generation on large data sets. Moreover, it is empirically found that the algorithm is efficient in terms of sample complexity.

[1] U. Fayyad and R. Uthurusamy, "Special Section: Data Mining and Knowledge Discovery in Databases: Introduction," Comm. ACM, Vol. 39, No. 11, Nov. 1996, pp. 24-26.
[2] F. Provost and V. Kolluri, “A Survey of Methods for Scaling Up Inductive Algorithms,” Data Mining and Knowledge Discovery, vol. 2, pp. 131-169, Mar. 1999.
[3] J. Catlett, “Megainduction: Machine Learning on Very Large Databases,” PhD thesis, Dept. of Computer Science, Univ. of Sydney, Australia, 1991.
[4] D.D. Lewis and J. Catlett, “Heterogeneous Uncertainty Sampling for Supervised Learning,” Machine Learning: Proc. 11th Int'l Conf., pp. 148-156, 1994.
[5] N. Roy and A. McCallum, Toward Optimal Active Learning through Sampling Estimation of Error Reduction Proc. Int'l Conf. Machine Learning, pp. 441-448, 2001.
[6] B.V. Dasarathy, Nearest Neighbor (NN) Norms: NN Patterns Classification Techniques. Los Alamitos, Calif.: IEEE CS Press, 1991.
[7] P.E. Hart, “The Condensed Nearest Neighbor Rule,” IEEE Trans. Information Theory, vol. 14, no. 3, pp. 515-516, 1968.
[8] D.R. Wilson and T.R. Martinez, “Reduction Techniques For Instance-Based Learning Algorithm,” Machine Learning, vol. 38, pp. 257-286, 2000.
[9] F. Ricci, P. Avesani, “Data Compression and Local Metrics for Nearest Neighbor Classification,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 21, pp. 380-384, 1999.
[10] M. Plutowski and H. White, “Selecting Concise Training Sets from Clean Data,” IEEE Trans. Neural Networks, vol. 4, no. 2, pp. 305-318, 1993.
[11] D.L. Reilly, L.N. Cooper, and C. Elbaum, “A Neural Model for Category Learning,” Biological Cybernetics, vol. 45, pp. 35-41, 1982.
[12] J. Platt, “A Resource-Allocating Network for Function Interpolation,” Neural Computation, vol. 3, pp. 213-255, 1991.
[13] R.M. Gray, "Vector Quantization," IEEE Acoustics, Speech and Signal Processing, pp. 4-29, Apr. 1984.
[14] T. Kohonen, “The Self-Organizing Map,” Proc. IEEE, vol. 78, no. 9, pp. 1464-1480, Sept. 1990.
[15] L. Xu, A. Krzyzak, and E. Oja, “Rival Penalised Competitive Learning for Cluster Analysis, RBF Net and Curve Detection,” IEEE Trans. Neural Networks vol. 4, pp. 636-649, 1993.
[16] J. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise,” Proc. Second Int'l Conf. Knowledge Discovery and Data Mining (KDD '96), pp. 226-231, 1996.
[17] M.M. Astrahan, “Speech Analysis by Clustering, or the Hyperphoneme Method,” Stanford A. I. Project Memo, Stanford Univ., Calif., 1970.
[18] D. Chaudhuri, C.A. Murthy, and B.B. Chaudhuri, “Finding a Subset of Representative Points in a Dataset,” IEEE Trans. Systems, Man, and Cybernetics, vol. 24, pp. 1416-1424, 1994.
[19] K. Fukunaga and J.M. Mantock, “Nonparametric Data Reduction,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 6, pp. 115-118, 1984.
[20] K. Deng and A.W. Moore, “Multiresolution Instance-Based Learning,” Proc. Int'l Joint Conf. Artificial Intelligence, 1995.
[21] A.W. Moore, J. Schneider, and K. Deng, “Efficient Locally Weighted Polynomial Regression Predictions,” Machine Learning: Proc. 14th Int'l Conf., pp. 236-244, 1997.
[22] Y. Leung, J. Zhang, and Z. Xu, “Clustering by Scale-Space Filtering,” IEEE Trans. Pattern Analysis Machine Intelligence, vol. 22, no. 12, pp. 1396-1410, Dec. 2000.
[23] S.V. Chakravarthy and J. Ghosh, Scale-Based Clustering Using the Radial Basis Function Network IEEE Trans. Neural Networks, vol. 7, pp. 1250-1261, 1996.
[24] Y. Wong and E.C. Posner, "A New Clustering Algorithm Applicable to Multiscale and Polarimetric SAR Images," IEEE Trans. Geoscience and Remote Sensing, vol. 31, pp. 634-644, May 1993.
[25] D.O. Loftsgaarden and C.P. Quesenberry, “A Nonparametric Estimate of a Multivariate Density Function,” Annals of Math. Statistics, vol. 36, pp. 1049-1051, 1965.
[26] K. Fukunaga, Introduction to Statistical Pattern Recognition, second edition. Academic Press, 1990.
[27] S.K. Pal, A. Ghosh, and B. Uma Shankar, “Segmentation of Remotely Sensed Images with Fuzzy Thresholding, and Quantitative Evaluation,” Int'l J. Remote Sensing, vol. 21, no. 11, pp. 2269-2300, 2000.
[28] L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone, Classification and Regression Trees. Wadsworth Inc., 1984.
[29] S.K. Pal and S. Mitra, Neuro-Fuzzy Pattern Recognition: Methods in Soft Computing Paradigm. New York: John Wiley, 1999.
[30] E.L. Lehmann, Testing of Statistical Hypotheses. New York: John Wiley, 1976.
[31] A. Aspin, “Tables for Use in Comparisons Whose Accuracy Involves two Variances,” Biometrika, vol. 36, pp. 245-271, 1949.
[32] J.R. Quinlan, C4.5: Programs for Machine Learning,San Mateo, Calif.: Morgan Kaufman, 1992.
[33] S. Arya, D.M. Mount, N.S. Netanyahu, R. Silverman, and A.Y. Wu, “An Optimal Algorithm for Approximate Nearest Neighbor Searching in Fixed Dimensions,” J. ACM, vol. 45, no. 6, pp. 891-923, Nov. 1998.
[34] A. Faragó, T. Linder, and G. Lugosi, “Nearest Neighbor Search and Classification in$\big. {\cal O}(1)\bigr.$Time,” Problems of Control and Information Theory, vol. 20, no. 6, pp. 383-395, 1991.

Index Terms:
Data mining, multiscale condensation, scalability, density estimation, convergence in probability, instance learning.
Pabitra Mitra, C.A. Murthy, Sankar K. Pal, "Density-Based Multiscale Data Condensation," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 6, pp. 734-747, June 2002, doi:10.1109/TPAMI.2002.1008381
Usage of this product signifies your acceptance of the Terms of Use.