This Article 
 Bibliographic References 
 Add to: 
KDX: An Indexer for Support Vector Machines
June 2006 (vol. 18 no. 6)
pp. 748-763
Support Vector Machines (SVMs) have been adopted by many data mining and information-retrieval applications for learning a mining or query concept, and then retrieving the "{\rm{top}}{\hbox{-}}k” best matches to the concept. However, when the data set is large, naively scanning the entire data set to find the top matches is not scalable. In this work, we propose a kernel indexing strategy to substantially prune the search space and, thus, improve the performance of {\rm{top}}{\hbox{-}}k queries. Our kernel indexer (KDX) takes advantage of the underlying geometric properties and quickly converges on an approximate set of {\rm{top}}{\hbox{-}}k instances of interest. More importantly, once the kernel (e.g., Gaussian kernel) has been selected and the indexer has been constructed, the indexer can work with different kernel-parameter settings (e.g., \gamma and \sigma) without performance compromise. Through theoretical analysis and empirical studies on a wide variety of data sets, we demonstrate KDX to be very effective. An earlier version of this paper appeared in the 2005 SIAM International Conference on Data Mining [24]. This version differs from the previous submission in providing a detailed cost analysis under different scenarios, specifically designed to meet the varying needs of accuracy, speed, and space requirements, developing an approach for insertion and deletion of instances, presenting the specific computations as well as the geometric properties used in performing the same, and providing detailed algorithms for each of the operations necessary to create and use the index structure.

[1] C.C. Aggarwal and P.S. Yu, “Outlier Detection for High Dimensional Data,” Proc. ACM SIGMOD Conf., 2001.
[2] N. Beckmann, H. Kriegel, R. Schneider, and B. Seeger, “The $R^{\ast}$ Tree: An Efficient and Robust Access Method for Points and Rectangles,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 322-331, 1990.
[3] S. Berchtold, D. Keim, and H.P. Kriegel, “The X-Tree: An Index Structure for High-Dimensional Data,” Proc. 22nd Conf. Very Large Databases, pp. 28-39, 1996.
[4] C.L. Blake and C.J. Merz, UCI Repository of Machine Learning Databases, 1998.
[5] M. Brown, W. Grundy, D. Lin, N. Christianini, C. Sugnet, M. Jr, and D. Haussler, Support Vector Machine Classification of Microarray Gene Expression Data, UCSC-CRL 99-09, Dept. of Computer Science, Univ. of California at Santa Cruz, 1999, .
[6] J. Chris, C. Burges, and B. Schölkopf, “Improving the Accuracy and Speed of Support Vector Machines,” Advances in Neural Information Processing Systems, M.C. Mozer, M.I. Jordan, and T. Petsche, eds., vol. 9, p. 375, The MIT Press, 1997.
[7] C.J.C. Burges, “Geometry and Invariance in Kernel Based Methods,” Advances in Kernel Methods, A.J. Smola, B. Schölkopf, C. Burges, eds., Cambridge, Mass.: MIT Press, 1998.
[8] E. Chang, K. Goh, G. Sychay, and G. Wu, “Content-Based Soft Annotation for Multimodal Image Retrieval Using Bayes Point Machines,” IEEE Trans. Circuits and Systems for Video Technology, special issue on conceptual and dynamical aspects of multimedia content description, vol. 13, no. 1, pp. 26-38, 2003.
[9] E. Chang and S. Tong, “Svm_Active— Support Vector Machine Active Learning for Image Retrieval,” Proc. Ninth ACM Int'l Conf. Multimedia, pp. 107-118, 2001.
[10] P. Ciaccia, M. Patella, and P. Zezula, “M-Tree: An Efficient Access Method for Similarity Search in Metric Spaces,” Proc. 23rd Int'l Conf. Very Large Databases, pp. 426-435, 1997.
[11] R. Cooley, “Classification of News Stories Using Support Vector Machines,” Proc. 16th Int'l Joint Conf. Artificial Intelligence Text Mining Workshop, 1999.
[12] C. Cortes and V. Vapnik, “Support-Vector Networks,” Machine Learning, vol. 20, no. 3, pp. 273-297, 1995.
[13] H. Drucker, D. Wu, and V. Vapnik, “Support Vector Machines for Spam Categorization,” IEEE Trans. Neural Networks, vol. 10, no. 5, pp. 1048-1054, 1999.
[14] T.S. Furey, N. Duffy, N. Cristianini, D. Bednarski, M. Schummer, and D. Haussler, “Support Vector Machine Classification and Validation of Cancer Tissue Samples Using Microarray Expression Data,” Bioinformatics, vol. 16, no. 10, pp. 906-914, 2000.
[15] A. Gionis, P. Indyk, and R. Motwani, “Similarity Search in High Dimensions via Hashing,” The VLDB J., pp. 518-529, 1999.
[16] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene Selection for Cancer Classification Using Support Vector Machines,” Machine Learning, vol. 46, nos. 1/3, pp. 389-422, Jan. 2002.
[17] T. Joachims, “Text Categorization with Support Vector Machines: Learning with Many Relevant Features,” Proc. 10th European Conf. Machine Learning, C. Nédellec and C. Rouveirol, eds., pp. 137-142, 1998.
[18] N. Katayama and S. Satoh, “The SR-Tree: An Index Structure for High-Dimensional Nearest Neighbor Queries,” Proc. ACM SIGMOD Int'l Conf. on Management of Data, pp. 369-380, 1997.
[19] H. Kim, P. Howland, and H. Park, “Dimension Reduction in Text Classification Using Support Vector Machines,” J. Machine Learning Research, to appear.
[20] C. Leslie, E. Eskin, and W.S. Noble, “The Spectrum Kernel: A String Kernel for SVM Protein Classification,” Proc. Pacific Symp. Biocomputing, R.B. Altman, A.K. Dunker, L. Hunter, K. Lauerdale, and T.E. Klein, eds., pp. 564-575, 2002.
[21] C. Li, E. Chang, H. Garcia-Molina, and G. Wilderhold, “Clindex: Approximate Similarity Queries in High-Dimensional Spaces,” IEEE Trans. Knowledge and Data Eng., vol. 14, no. 4, July/Aug. 2002.
[22] K.-I. Lin, H.V. Jagadish, and C. Faloutsos, “The TV-Tree: An Index Structure for High-Dimensional Data,” VLDB J.: Very Large Data Bases, vol. 3, no. 4, pp. 517-542, 1994.
[23] E. Osuna, R. Freund, and F. Girosi, “Training Support Vector Machines: An Application to Face Detection,” Proc. 1997 Conf. Computer Vision and Pattern Recognition (CVPR '97), pp. 130-138, 1997.
[24] N. Panda and E.Y. Chang, “Exploiting Geometry for Support Vector Machine Indexing,” Proc. SIAM Int'l Conf. Data Mining, 2005.
[25] P. Pavlidis, J. Weston, J. Cai, and W.N. Grundy, “Gene Functional Classification from Heterogeneous Data,” Proc. Fifth Ann. Int'l Conf. Computational Biology, pp. 249-255, 2001.
[26] B. Scholkopf, C. Burges, and V. Vapnik, “Extracting Support Data for a Given Task,” 1995.
[27] B. Scholkopf, R. Williamson, A. Smola, J. Shawe-Taylor, and J. Platt, Support Vector Method for Novelty Detection, pp. 582-588. MIT Press, 2000.
[28] S. Tong and D. Koller, “Support Vector Machine Active Learning with Applications to Text Classification,” Proc. 17th Int'l Conf. Machine Learning, P. Langley, ed., pp. 999-1006, 2000.
[29] V. Vapnik, The Nature of Statistical Learning Theory. Springer Verlag, 1995.
[30] R. Weber, H.-J. Schek, and S. Blott, “A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces,” Proc. 24th Int'l Conf. Very Large Data Bases, pp. 194-205, 24-27, 1998.

Index Terms:
Support vector machine, indexing, {\rm{top}}{\hbox{-}}k retrieval.
Navneet Panda, Edward Y. Chang, "KDX: An Indexer for Support Vector Machines," IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 6, pp. 748-763, June 2006, doi:10.1109/TKDE.2006.101
Usage of this product signifies your acceptance of the Terms of Use.