Subscribe
Issue No.08 - Aug. (2012 vol.24)
pp: 1422-1434
Shuang-Hong Yang , Georgia Institute of Technology, Atlanta
Bao-Gang Hu , Chinese Academy of Sciences, Beijing
ABSTRACT
Feature selection is fundamental to knowledge discovery from massive amount of high-dimensional data. In an effort to establish theoretical justification for feature selection algorithms, this paper presents a theoretically optimal criterion, namely, the discriminative optimal criterion (DoC) for feature selection. Compared with the existing representative optimal criterion (RoC, [CHECK END OF SENTENCE]) which retains maximum information for modeling the relationship between input and output variables, DoC is pragmatically advantageous because it attempts to directly maximize the classification accuracy and naturally reflects the Bayes error in the objective. To make DoC computationally tractable for practical tasks, we propose an algorithmic framework, which selects a subset of features by minimizing the Bayes error rate estimated by a nonparametric estimator. A set of existing algorithms as well as new ones can be derived naturally from this framework. As an example, we show that the Relief algorithm [CHECK END OF SENTENCE] greedily attempts to minimize the Bayes error estimated by the k-Nearest-Neighbor (kNN) method. This new interpretation insightfully reveals the secret behind the family of margin-based feature selection algorithms [CHECK END OF SENTENCE], [CHECK END OF SENTENCE] and also offers a principled way to establish new alternatives for performance improvement. In particular, by exploiting the proposed framework, we establish the Parzen-Relief (P-Relief) algorithm based on Parzen window estimator, and the MAP-Relief (M-Relief) which integrates label distribution into the max-margin objective to effectively handle imbalanced and multiclass data. Experiments on various benchmark data sets demonstrate the effectiveness of the proposed algorithms.
INDEX TERMS
Feature selection, discriminative optimal criterion, feature weighting.
CITATION
Shuang-Hong Yang, Bao-Gang Hu, "Discriminative Feature Selection by Nonparametric Bayes Error Minimization", IEEE Transactions on Knowledge & Data Engineering, vol.24, no. 8, pp. 1422-1434, Aug. 2012, doi:10.1109/TKDE.2011.92
REFERENCES
 [1] M. Belkin and P. Niyogi, "Laplacian Eigenmaps for Dimensionality Reduction and Data Representation," Neural Computation, vol. 15, no. 6, pp. 1373-1396, June 2003. [2] P.J. Bickel, Y. Ritov, and A. Tsybakov, "Simultaneous Analysis of Lasso and Dantzig Selector," Annals of Statistics, vol. 37, no. 4, pp. 1705-1732, 2009. [3] C.M. Bishop, Pattern Recognition and Machine Learning, Information Science and Statistics, 1 ed. Springer, 2007. [4] G. Carneiro and N. Vasconcelos, "Minimum Bayes Error Features for Visual Recognition by Sequential Feature Selection and Extraction," Proc. Computer and Robot Vision Conf. (CRV '05), pp. 253-260, 2005. [5] B. Chen, H. Liu, J. Chai, and Z. Bao, "Large Margin Feature Weighting Method via Linear Programming," IEEE Trans. Knowledge Data Eng., vol. 21, no. 10, pp. 1475-1488, Oct. 2009. [6] Q. Chen and Y.-P.P. Chen, "Discovery of Structural and Functional Features in RNA Pseudoknots," IEEE Trans. Knowledge Data Eng., vol. 21, no. 7, pp. 974-984, July 2009. [7] E. Choi and C. Lee, "Feature Extraction Based on the Bhattacharyya Distance," Pattern Recognition, vol. 36, no. 8, pp. 1703-1709, 2003. [8] F.R.K. Chung, "Spectral Graph Theory," Proc. Regional Conf. in Math. (AMS 1992), vol. 92, 1997. [9] E.F. Combarro, E. Montañés, I. Díaz, J. Ranilla, and R. Mones, "Introducing a Family of Linear Measures for Feature Selection in Text Categorization," IEEE Trans. Knowledge Data Eng., vol. 17, no. 9, pp. 1223-1232, Sept. 2005. [10] M. Dash and H. Liu, "Feature Selection for Classification," Intelligent Data Analysis, vol. 1, nos. 1-4, pp. 131-156, 1997. [11] R. Duda, P.E. Hart, and D.G. Stock, Pattern Classification, ch. 4, second, ed., pp. 161-214, John-Wiley & Sons, 2001. [12] J. Fan and R. Li, "Statistical Challenges with High Dimensionality: Feature Selection in Knowledge Discovery," Proc. Madrid Int'l Congress of Mathematicians (ICM '06), pp. 595-622, 2006. [13] K. Fukunaga and D.M. Hummels, "Bayes Error Estimation Using Parzen and k-NN Procedures," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. PAMI-9, no. 5, pp. 634-643, Sept. 1987. [14] R. Gilad-Bachrach, A. Navot, and N. Tishby, "Margin Based Feature Selection—Theory and Algorithms," Proc. 21st Int'l Conf. Machine Learning (ICML '04), 2004. [15] I. Guyon and A. Elisseeff, "An Introduction to Variable and Feature Selection," J. Machine Learning Research, vol. 3, pp. 1157-1182, 2003. [16] M.A. Hall and G. Holmes, "Benchmarking Attribute Selection Techniques for Discrete Class Data Mining," IEEE Trans. Knowledge and Data Eng., vol. 15, no. 6, pp. 1437-1447, Nov./Dec. 2003. [17] K.E. Hild, D. Erdogmus, K. Torkkola, and J.C. Principe, "Feature Extraction Using Information-Theoretic Learning," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 28, no. 9, pp. 1385-1392, Sept. 2006. [18] T.S. Jaakkola and D. Haussler, "Exploiting Generative Models in Discriminative Classifiers," Proc. 1998 Conf. Advances in Neural Information Processing Systems II, pp. 487-493, 1999. [19] A.K. Jain and D.E. Zongker, "Feature Selection: Evaluation, Application, and Small Sample Performance," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 19, no. 2, pp. 153-158, Feb. 1997. [20] K. Kira and L.A. Rendell, "A Practical Approach to Feature Selection," Proc. Ninth Int'l Workshop Machine Learning (ICML '92), pp. 249-256, 1992. [21] D. Koller and M. Sahami, "Toward Optimal Feature Selection," Proc. 13th Int'l Workshop Machine Learning (ICML '96), pp. 284-292, 1996. [22] I. Kononenko, "Estimating Attributes: Analysis and Extensions of Relief," Proc. European Conf. Machine Learning (ECML '94), pp. 171-182, 1994. [23] O.L. Mangasarian and D.R. Musicant, "Lagrangian Support Vector Machines," J. Machine Learning Research, vol. 1, pp. 161-177, 2001. [24] G. Qu, S. Hariri, and M.S. Yousif, "A New Dependency and Correlation Analysis for Features," IEEE Trans. Knowledge Data Eng., vol. 17, no. 9, pp. 1199-1207, Sept. 2005. [25] M. Robnik-Sikonja and I. Kononenko, "Comprehensible Interpretation of Relief's Estimates," Proc. 18th Int'l Conf. Machine Learning (ICML '01), pp. 433-440, 2001. [26] M. Robnik-Sikonja and I. Kononenko, "Theoretical and Empirical Analysis of Relieff and Rrelieff," Machine Learning, vol. 53, nos. 1/2, pp. 23-69, 2003. [27] G. Saon and M. Padmanabhan, "Minimum Bayes Error Feature Selection for Continuous Speech Recognition," Proc. Advances in Neural Information Processing Systems 13 (NIPS '01), pp. 800-806, 2001. [28] Y. Sun, "Iterative Relief for Feature Weighting: Algorithms, Theories, and Applications," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 29, no. 6, pp. 1035-1051, June 2007. [29] K. Torkkola, "Feature Extraction by Non Parametric Mutual Information Maximization," J. Machine Learning Research, vol. 3, pp. 1415-1438, 2003. [30] V.N. Vapnik, Statistical Learning Theory. Wiley-Interscience, Sept. 1998. [31] N. Vasconcelos, "Feature Selection by Maximum Marginal Diversity: Optimality and Implications for Visual Recognition," Proc. IEEE CS Conf. Computer Vision and Pattern Recognition (CVPR '03), pp. 762-772, 2003. [32] J. Weston, A. Elisseeff, B. Schölkopf, and M. Tipping, "Use of the Zero Norm with Linear Models and Kernel Methods," J. Machine Learning Research, vol. 3, pp. 1439-1461, 2003. [33] D. Wettschereck, D.W. Aha, and T. Mohri, "A Review and Empirical Evaluation of Feature Weighting Methods for a Class of Lazy Learning Algorithms," Artificial Intelligence Rev., vol. 11, pp. 273-314, 1997. [34] E.P. Xing, M.I. Jordan, and R.M. Karp, "Feature Selection for High-Dimensional Genomic Microarray Data," Proc. 18th Int'l Conf. Machine Learning (ICML '01), pp. 601-608, 2001. [35] G. Xuan, X. Zhu, P. Chai, Z. Zhang, Y.Q. Shi, and D. Fu, "Feature Selection Based on the Bhattacharyya Distance," Proc. 18th Int'l Conf. Pattern Recognition (ICPR '06), pp. 1232-1235, 2006. [36] S.-H. Yang and B.-G. Hu, "Feature Selection by Nonparametric Bayes Error Minimization," Proc. 12th Pacific-Asian Conf. Knowledge Discovery and Data Mining (PAKDD '08), pp. 417-428, 2008. [37] S.-H. Yang and B.-G. Hu, "Efficient Feature Selection in the Presence of Outliers And Noises," Proc. Asian Conf. Information Retrieval (AIRS '08), pp. 188-195, [38] S.-H. Yang, H. Zha, K.S. Zhou, and B.-G. Hu, "Variational Graph Embedding for Globally and Locally Consistent Feature Extraction," Proc. European Conf. Machine Learning (ECML '09), p. 538C553, 2009. [39] S.-H. Yang and H. Zha, "Language Pyramid and Multi-scale Text Analysis," Proc. 19th ACM Int'l Conf. Information and Knowledge Management (CIKM '10), pp. 639-648, 2010. [40] Y. Yang and X. Liu, "A Re-Examination of Text Categorization Methods," Proc. 22nd Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR '99), pp. 42-49, 1999. [41] Y. Yang and J.O. Pedersen, "A Comparative Study on Feature Selection in Text Categorization," Proc. 14th Int'l Conf. Machine Learning (ICML '97), pp. 412-420, 1997. [42] H. Yoon, K. Yang, and C. Shahabi, "Feature Subset Selection and Feature Ranking for Multivariate Time Series," IEEE Trans. Knowledge Data Eng., vol. 17, no. 9, pp. 1186-1198, Sept. 2005. [43] K. Yu, X. Xu, M. Ester, and H.-P. Kriegel, "Feature Weighting and Instance Selection for Collaborative Filtering: An Information-Theoretic Approach${\ast}$ ," Knowledge and Information Systems, vol. 5, no. 2, pp. 201-224, 2003. [44] L. Yu and H. Liu, "Efficient Feature Selection via Analysis of Relevance and Redundancy," J. Machine Learning Research, vol. 5, pp. 1205-1224, 2004. [45] Z. Zhao and H. Liu, "Spectral Feature Selection for Supervised and Unsupervised Learning," Proc. 24th Int'l Conf. Machine Learning (ICML '07), 2007.