This Article 
 Bibliographic References 
 Add to: 
Condensed Nearest Neighbor Data Domain Description
October 2007 (vol. 29 no. 10)
pp. 1746-1758
A simple yet effective unsupervised classification rule to discriminate between normal and abnormal data is based on accepting test objects whose nearest neighbors distances in a reference data set, assumed to model normal behavior, lie within a certain threshold. This work investigates the effect of using a subset of the original data set as the reference set of the classifier. With this aim, the concept of a reference consistent subset is introduced and it is shown that finding the minimum cardinality reference consistent subset is intractable. Then, the CNNDD algorithm is described, which computes a reference consistent subset with only two reference set passes. Experimental results revealed the advantages of condensing the data set and confirmed the effectiveness of the proposed approach. A thorough comparison with related methods was accomplished, pointing out the strengths and weaknesses of one-class nearest-neighbor-based training set consistent condensation.

[1] F. Angiulli, “Fast Condensed Nearest Neighbor Rule,” Proc. 22nd Int'l Conf. Machine Learning, pp. 7-11, Aug. 2005.
[2] F. Angiulli and C. Pizzuti, “Fast Outlier Detection in High-Dimensional Spaces,” Proc. European Conf. Principles and Practice of Knowledge Discovery in Databases, pp. 15-26, 2002.
[3] M. Breunig, H.P. Kriegel, R. Ng, and J. Sander, “Lof: Identifying Density-Based Local Outliers,” Proc. ACM Int'l Conf. Management of Data, 2000.
[4] V. Cerverón and F.J. Ferri, “Another Move toward the Minimum Consistent Subset: A Tabu Search Approach to the Condensed Nearest Neighbor Rule,” IEEE Trans. Systems, Man, and Cybernetics —Part B: Cybernetics, vol. 31, no. 3, pp. 304-408, 2001.
[5] C.-C. Chang and C.-J. Lin, “LIBSVM: A Library for Support Vector Machines,” 2001,
[6] T.M. Cover and P.E. Hart, “Nearest Neighbor Pattern Classification,” IEEE Trans. Information Theory, vol. 13, pp. 21-27, 1967.
[7] B. Dasarathy, “Minimal Consistent Subset (MCS) Identification for Optimal Nearest Neighbor Decision Systems Design,” IEEE Trans. Systems, Man, and Cybernetics, vol. 24, no. 3, pp. 511-517, 1994.
[8] L. Devroye, “On the Inequality of Cover and Hart,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 3, pp. 75-78, 1981.
[9] L. Devroye, L. Gyorfi, and G. Lugosi, A Probabilistic Theory of Pattern Recognition. Springer-Verlag, 1996.
[10] C.L. Blake, D.J. Newman, S. Hettich, and C.J. Merz, “UCI Repository of Machine Learning Databases,” 1998.
[11] E. Eskin, A. Arnold, M. Prerau, L. Portnoy, and S. Stolfo, “A Geometric Framework for Unsupervised Anomaly Detection: Detecting Intrusions in Unlabeled Data,” Applications of Data Mining in Computer Security, 2002.
[12] E. Fix and J. Hodges, “Discriminatory Analysis,” Non-Parametric Discrimination: Consistency Properties, Technical Report 4, School of Aviation Medicine, US Air Force (USAF), Randolph Field, Texas, 1951.
[13] S. Floyd and M. Warmuth, “Sample Compression, Learnability and the Vapnik-Chervonenkis Dimension,” Machine Learning, vol. 21, no. 3, pp. 269-304, 1995.
[14] M.R. Garey and D.S. Johnson, Computer and Intractability. W.H. Freeman and Co., 1979.
[15] P.E. Hart, “The Condensed Nearest Neighbor Rule,” IEEE Trans. Information Theory, vol. 14, pp. 515-516, 1968.
[16] D.S. Hochbaum and D.B. Shmoys, “A Best Possible Heuristic for the $k$ -Center Problem,” Math. Operations Research, vol. 10, no. 2, pp.180-184, 1985.
[17] E. Knorr and R. Ng, “Algorithms for Mining Distance-Based Outliers in Large Datasets,” Proc. Int'l Conf. Very Large Databases, pp. 392-403, 1998.
[18] N. Littlestone and M. Warmuth, “Relating Data Compression and Learnability,” technical report, Univ. of California, Santa Cruz, 1986.
[19] S. Ramaswamy, R. Rastogi, and K. Shim, “Efficient Algorithms for Mining Outliers from Large Data Sets,” Proc. ACM Int'l Conf. Management of Data, pp. 427-438, 2000.
[20] G. Ritter and M.T. Gallegos, “Outliers in Statistical Pattern Recognition and an Application to Automatic Chromosome Classification,” Pattern Recognition Letters, vol. 18, pp. 525-539, Apr. 1997.
[21] B. Schölkopf, C. Burges, and V. Vapnik, “Extracting Support Data for a Given Task,” Proc. Int'l Conf. Knowledge Discovery and Data Mining, pp. 251-256, 1995.
[22] B. Schölkopf, J. Platt, J. Shawe-Taylor, A.J. Smola, and R.C. Williamson, “Estimating the Support of a High-Dimensional Distribution,” Technical Report 87, Microsoft Research, Redmond, Wash., 1999.
[23] C. Stone, “Consistent Nonparametric Regression,” Annals of Statistics, vol. 8, pp. 1348-1360, 1977.
[24] D. Tax and R. Duin, “Data Domain Description Using Support Vectors,” Proc. European Symp. Artificial Neural Networks, pp. 251-256, Apr. 1999.
[25] D. Tax and R. Duin, “Data Descriptions in Subspaces,” Proc. Int'l Conf. Pattern Recognition, pp. 672-675, 2000.
[26] D.M.J. Tax, “One-Class Classification,” PhD dissertation, Delft Univ. of Tech nology, June 2001.
[27] G. Toussaint, “Proximity Graphs for Nearest Neighbor Decision Rules: Recent Progress,” Technical Report SOCS-02.5, School of Computer Science, McGill Univ., Montreal, Québec, 2002.
[28] A. Ypma and R. Duin, “Support Objects for Domain Approximation,” Proc. Internet Corporation for Assigned Names and NumbersConf., 1998.

Index Terms:
classification, data domain description, data condensation, nearest neighbor rule, novelty detection
Fabrizio Angiulli, "Condensed Nearest Neighbor Data Domain Description," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 10, pp. 1746-1758, Oct. 2007, doi:10.1109/TPAMI.2007.1086
Usage of this product signifies your acceptance of the Terms of Use.