loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Distributed Nearest Neighbor-Based Condensation of Very Large Data Sets
December 2007 (vol. 19 no. 12)
pp. 1593-1606
In this work, PFCNN, a distributed method for computing a consistent subset of very large data set for the nearest neighbor classification rule is presented. In order to cope with the communication overhead typical of distributed environments and to reduce memory requirements, different variants of the basic PFCNN method are introduced. An analysis of spatial cost, CPU cost, and communication overhead is accomplished for all the algorithms. Experimental results, performed on both synthetic and real very large data sets, revealed that these methods can be profitably applied to enormous collections of data. Indeed, they scale-up well and are efficient in memory consumption, confirming the theoretical analysis, and achieve noticeable data reduction and good classification accuracy. To the best of our knowledge, this is the first distributed algorithm for computing a training set consistent subset for the nearest neighbor rule.

[1] D.W. Aha, “Editorial,” Artificial Intelligence Rev., special issue on lazy learning, vol. 11, nos. 1-5, pp. 7-10, 1997.
[2] D.W. Aha, D. Kibler, and M.K. Albert, “Instance-Based Learning Algorithms,” Machine Learning, vol. 6, pp. 37-66, 1991.
[3] F. Angiulli, “Fast Condensed Nearest Neighbor Rule,” Proc. 22nd Int'l Conf. Machine Learning (ICML '05), 2005.
[4] E. Aplaydin, “Voting over Multiple Condensed Nearest Neighbors,” Artificial Intelligence Rev., vol. 11, pp. 115-132, 1997.
[5] S. Bay, “Combining Nearest Neighbor Classifiers through Multiple Feature Subsets,” Proc. 15th Int'l Conf. Machine Learning (ICML '98), 1998.
[6] S. Bay, “Nearest Neighbor Classification from Multiple Feature Sets,” Intelligent Data Analysis, vol. 3, pp. 191-209, 1999.
[7] A. Beygelzimer, S. Kakade, and J. Langford, “Cover Trees for Nearest Neighbor,” Proc. 23rd Int'l Conf. Machine Learning (ICML '06), 2006.
[8] H. Brighton and C. Mellish, “Advances in Instance Selection for Instance-Based Learning Algorithms,” Data Mining and Knowledge Discovery, vol. 6, no. 2, pp. 153-172, 2002.
[9] T.M. Cover and P.E. Hart, “Nearest Neighbor Pattern Classification,” IEEE Trans. Information Theory, vol. 13, no. 1, pp. 21-27, 1967.
[10] B. Dasarathy, Nearest Neighbor (NN) Norms-NN Pattern Classification Techniques. IEEE CS Press, 1991.
[11] B. Dasarathy, “Minimal Consistent Subset (MCS) Identification for Optimal Nearest Neighbor Decision Systems Design,” IEEE Trans. Systems, Man, and Cybernetics, vol. 24, no. 3, pp. 511-517, 1994.
[12] B. Dasarathy, “Nearest Unlike Neighbor (NUN): An Aid to Decision Confidence Estimation,” Optical Eng., vol. 34, pp. 2785-2792, 1995.
[13] F.S. Devi and M.N. Murty, “An Incremental Prototype Set Building Technique,” Pattern Recognition, vol. 35, no. 2, pp. 505-513, 2002.
[14] L. Devroye, “On the Inequality of Cover and Hart in Nearest Neighbor Discrimination,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 3, pp. 75-78, 1981.
[15] I. Foster and C. Kesselman, The Grid2: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, 2003.
[16] A.A. Freitas and S.H. Lavington, Mining Very Large Databases with Parallel Processing. Kluwer Academic Publishers, 1998.
[17] J. Fürnkranz, “Round Robin Classification,” J. Machine Learning Research, vol. 2, pp. 721-747, 2002.
[18] V. Gaede and O. Günther, “Multidimensional Access Methods,” ACM Computing Surveys, vol. 30, no. 2, pp. 170-231, 1998.
[19] W. Gates, “The Reduced Nearest Neighbor Rule,” IEEE Trans. Information Theory, vol. 18, no. 3, pp. 431-433, 1972.
[20] W. Gropp, E. Lusk, N. Doss, and A. Skjellum, “A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard,” Parallel Computing, vol. 22, no. 6, pp. 789-828, Sept. 1996.
[21] J. Han, Data Mining: Concepts and Techniques. Morgan Kaufmann, 2005.
[22] P.E. Hart, “The Condensed Nearest Neighbor Rule,” IEEE Trans. Information Theory, vol. 14, no. 3, pp. 515-516, 1968.
[23] B. Karaçali and H. Krim, “Fast Minimization of Structural Risk by Nearest Neighbor Rule,” IEEE Trans. Neural Networks, vol. 14, no. 1, pp. 127-134, 2002.
[24] N.T. Karonis, B. Toonen, and I. Foster, “MPICH-G2: A Grid-Enabled Implementation of the Message Passing Interface,” J.Parallel and Distributed Computing, vol. 63, no. 5, pp. 551-563, 2003.
[25] C.L. Liu and M. Nakagawa, “Evaluation of Prototype Learning Algorithms for Nearest-Neighbor Classifier in Application to Handwritten Character Recognition,” Pattern Recognition, vol. 34, no. 3, pp. 601-615, 2001.
[26] B.-L. Lu and M. Ito, “Task Decomposition and Module Combination Based on Class Relations: A Modular Neural Network for Pattern Classification,” IEEE Trans. Neural Networks, vol. 10, no. 5, pp. 1244-1256, 1999.
[27] C. Stanfill and D. Waltz, “Towards Memory-Based Reasoning,” Comm. ACM, vol. 29, pp. 1213-1228, 1994.
[28] C. Stone, “Consistent Nonparametric Regression,” Annals of Statistics, vol. 8, pp. 1348-1360, 1977.
[29] G. Toussaint, “Proximity Graphs for Nearest Neighbor Decision Rules: Recent Progress,” Proc. 34th Symp. Interface of Computing Science and Statistics, Apr. 2002.
[30] I.W. Tsang, J.T. Kwok, and P.-M. Cheung, “Core Vector Machines: Fast SVM Training on Very Large Data Sets,” J. Machine Learning Research, vol. 6, pp. 363-392, 2005.
[31] P. Viswanath, M.N. Murty, and S. Bhatnagar, “Fusion of Multiple Approximate Nearest Neighbor Classifiers for Fast and Efficient Classification,” Information Fusion, vol. 5, no. 4, pp. 239-250, 2004.
[32] I. Watson and F. Marir, “Case-Based Reasoning: A Review,” The Knowledge Eng. Rev., vol. 9, no. 4, 1994.
[33] G. Wilfong, “Nearest Neighbor Problems,” Int'l J. Computational Geometry and Applications, vol. 2, no. 4, pp. 383-416, 1992.
[34] D.R. Wilson and T.R. Martinez, “Reduction Techniques for Instance-Based Learning Algorithms,” Machine Learning, vol. 38, no. 3, pp. 257-286, 2000.
[35] H. Zhao and B.-L. Lu, “A Modular $k\hbox{-}{\rm Nearest}$ Neighbor Classification Method for Massively Parallel Text Categorization,” Proc. First Int'l Symp. Computational and Information Science (CIS '04), 2004.

Index Terms:
Distributed systems, Clustering, classification, and association rules, Data mining
Citation:
Fabrizio Angiulli, Gianluigi Folino, "Distributed Nearest Neighbor-Based Condensation of Very Large Data Sets," IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 12, pp. 1593-1606, Aug. 2007, doi:10.1109/TKDE.2007.190665
Usage of this product signifies your acceptance of the Terms of Use.