This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Condensed Nearest Neighbor Data Domain Description
October 2007 (vol. 29 no. 10)
pp. 1746-1758
A simple yet effective unsupervised classification rule to discriminate between normal and abnormal data is based on accepting test objects whose nearest neighbors distances in a reference data set, assumed to model normal behavior, lie within a certain threshold. This work investigates the effect of using a subset of the original data set as the reference set of the classifier. With this aim, the concept of a reference consistent subset is introduced and it is shown that finding the minimum cardinality reference consistent subset is intractable. Then, the CNNDD algorithm is described, which computes a reference consistent subset with only two reference set passes. Experimental results revealed the advantages of condensing the data set and confirmed the effectiveness of the proposed approach. A thorough comparison with related methods was accomplished, pointing out the strengths and weaknesses of one-class nearest-neighbor-based training set consistent condensation.

[1] F. Angiulli, “Fast Condensed Nearest Neighbor Rule,” Proc. 22nd Int'l Conf. Machine Learning, pp. 7-11, Aug. 2005.
[2] F. Angiulli and C. Pizzuti, “Fast Outlier Detection in High-Dimensional Spaces,” Proc. European Conf. Principles and Practice of Knowledge Discovery in Databases, pp. 15-26, 2002.
[3] M. Breunig, H.P. Kriegel, R. Ng, and J. Sander, “Lof: Identifying Density-Based Local Outliers,” Proc. ACM Int'l Conf. Management of Data, 2000.
[4] V. Cerverón and F.J. Ferri, “Another Move toward the Minimum Consistent Subset: A Tabu Search Approach to the Condensed Nearest Neighbor Rule,” IEEE Trans. Systems, Man, and Cybernetics —Part B: Cybernetics, vol. 31, no. 3, pp. 304-408, 2001.
[5] C.-C. Chang and C.-J. Lin, “LIBSVM: A Library for Support Vector Machines,” 2001, http://www.csie.ntu.edu.tw/~cjlinlibsvm.
[6] T.M. Cover and P.E. Hart, “Nearest Neighbor Pattern Classification,” IEEE Trans. Information Theory, vol. 13, pp. 21-27, 1967.
[7] B. Dasarathy, “Minimal Consistent Subset (MCS) Identification for Optimal Nearest Neighbor Decision Systems Design,” IEEE Trans. Systems, Man, and Cybernetics, vol. 24, no. 3, pp. 511-517, 1994.
[8] L. Devroye, “On the Inequality of Cover and Hart,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 3, pp. 75-78, 1981.
[9] L. Devroye, L. Gyorfi, and G. Lugosi, A Probabilistic Theory of Pattern Recognition. Springer-Verlag, 1996.
[10] C.L. Blake, D.J. Newman, S. Hettich, and C.J. Merz, “UCI Repository of Machine Learning Databases,” 1998.
[11] E. Eskin, A. Arnold, M. Prerau, L. Portnoy, and S. Stolfo, “A Geometric Framework for Unsupervised Anomaly Detection: Detecting Intrusions in Unlabeled Data,” Applications of Data Mining in Computer Security, 2002.
[12] E. Fix and J. Hodges, “Discriminatory Analysis,” Non-Parametric Discrimination: Consistency Properties, Technical Report 4, School of Aviation Medicine, US Air Force (USAF), Randolph Field, Texas, 1951.
[13] S. Floyd and M. Warmuth, “Sample Compression, Learnability and the Vapnik-Chervonenkis Dimension,” Machine Learning, vol. 21, no. 3, pp. 269-304, 1995.
[14] M.R. Garey and D.S. Johnson, Computer and Intractability. W.H. Freeman and Co., 1979.
[15] P.E. Hart, “The Condensed Nearest Neighbor Rule,” IEEE Trans. Information Theory, vol. 14, pp. 515-516, 1968.
[16] D.S. Hochbaum and D.B. Shmoys, “A Best Possible Heuristic for the $k$ -Center Problem,” Math. Operations Research, vol. 10, no. 2, pp.180-184, 1985.
[17] E. Knorr and R. Ng, “Algorithms for Mining Distance-Based Outliers in Large Datasets,” Proc. Int'l Conf. Very Large Databases, pp. 392-403, 1998.
[18] N. Littlestone and M. Warmuth, “Relating Data Compression and Learnability,” technical report, Univ. of California, Santa Cruz, 1986.
[19] S. Ramaswamy, R. Rastogi, and K. Shim, “Efficient Algorithms for Mining Outliers from Large Data Sets,” Proc. ACM Int'l Conf. Management of Data, pp. 427-438, 2000.
[20] G. Ritter and M.T. Gallegos, “Outliers in Statistical Pattern Recognition and an Application to Automatic Chromosome Classification,” Pattern Recognition Letters, vol. 18, pp. 525-539, Apr. 1997.
[21] B. Schölkopf, C. Burges, and V. Vapnik, “Extracting Support Data for a Given Task,” Proc. Int'l Conf. Knowledge Discovery and Data Mining, pp. 251-256, 1995.
[22] B. Schölkopf, J. Platt, J. Shawe-Taylor, A.J. Smola, and R.C. Williamson, “Estimating the Support of a High-Dimensional Distribution,” Technical Report 87, Microsoft Research, Redmond, Wash., 1999.
[23] C. Stone, “Consistent Nonparametric Regression,” Annals of Statistics, vol. 8, pp. 1348-1360, 1977.
[24] D. Tax and R. Duin, “Data Domain Description Using Support Vectors,” Proc. European Symp. Artificial Neural Networks, pp. 251-256, Apr. 1999.
[25] D. Tax and R. Duin, “Data Descriptions in Subspaces,” Proc. Int'l Conf. Pattern Recognition, pp. 672-675, 2000.
[26] D.M.J. Tax, “One-Class Classification,” PhD dissertation, Delft Univ. of Tech nology, June 2001.
[27] G. Toussaint, “Proximity Graphs for Nearest Neighbor Decision Rules: Recent Progress,” Technical Report SOCS-02.5, School of Computer Science, McGill Univ., Montreal, Québec, 2002.
[28] A. Ypma and R. Duin, “Support Objects for Domain Approximation,” Proc. Internet Corporation for Assigned Names and NumbersConf., 1998.

Index Terms:
classification, data domain description, data condensation, nearest neighbor rule, novelty detection
Citation:
Fabrizio Angiulli, "Condensed Nearest Neighbor Data Domain Description," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 10, pp. 1746-1758, Oct. 2007, doi:10.1109/TPAMI.2007.1086
Usage of this product signifies your acceptance of the Terms of Use.