Subscribe
Issue No.03 - March (2012 vol.24)
pp: 465-477
Kashif Javed , University of Engineering and Technology, Lahore
Haroon A. Babri , University of Engineering and Technology, Lahore
Mehreen Saeed , National University of Computer and Emerging Sciences, Lahore
ABSTRACT
Data and knowledge management systems employ feature selection algorithms for removing irrelevant, redundant, and noisy information from the data. There are two well-known approaches to feature selection, feature ranking (FR) and feature subset selection (FSS). In this paper, we propose a new FR algorithm, termed as class-dependent density-based feature elimination (CDFE), for binary data sets. Our theoretical analysis shows that CDFE computes the weights, used for feature ranking, more efficiently as compared to the mutual information measure. Effectively, rankings obtained from both the two criteria approximate each other. CDFE uses a filtrapper approach to select a final subset. For data sets having hundreds of thousands of features, feature selection with FR algorithms is simple and computationally efficient but redundant information may not be removed. On the other hand, FSS algorithms analyze the data for redundancies but may become computationally impractical on high-dimensional data sets. We address these problems by combining FR and FSS methods in the form of a two-stage feature selection algorithm. When introduced as a preprocessing step to the FSS algorithms, CDFE not only presents them with a feature subset, good in terms of classification, but also relieves them from heavy computations. Two FSS algorithms are employed in the second stage to test the two-stage feature selection idea. We carry out experiments with two different classifiers (naive Bayes' and kernel ridge regression) on three different real-life data sets (NOVA, HIVA, and GINA) of the ”Agnostic Learning versus Prior Knowledge” challenge. As a stand-alone method, CDFE shows up to about 92 percent reduction in the feature set size. When combined with the FSS algorithms in two-stages, CDFE significantly improves their classification accuracy and exhibits up to 97 percent reduction in the feature set size. We also compared CDFE against the winning entries of the challenge and found that it outperforms the best results on NOVA and HIVA while obtaining a third position in case of GINA.
INDEX TERMS
Feature ranking, binary data, feature subset selection, two-stage feature selection, classification.
CITATION
Kashif Javed, Haroon A. Babri, Mehreen Saeed, "Feature Selection Based on Class-Dependent Densities for High-Dimensional Binary Data", IEEE Transactions on Knowledge & Data Engineering, vol.24, no. 3, pp. 465-477, March 2012, doi:10.1109/TKDE.2010.263
REFERENCES
 [1] I. Guyon, A. Saffari, G. Dror, and G. Cawley, "Agnostic Learning vs. Prior Knowledge Challenge," Proc. Int'l Joint Conf. Neural Networks (IJCNN), http:/www.agnostic.inf.ethz.ch, 2007. [2] "Feature Selection Challenge by Neural Information Processing Systems Conference (NIPS)," http:/www.nipsfsc.ecs.soton. ac.uk, 2003. [3] R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, second ed. Wiley, 2001. [4] D. Koller and M. Sahami, "Toward Optimal Feature Selection," Proc. 13th Int'l Conf. Machine Learning, pp. 284-292, 1996. [5] L. Yu and H. Liu, "Efficient Feature Selection via Analysis of Relevance and Redundancy," J. Machine Learning Research, vol. 5, pp. 1205-1224, 2004. [6] M. Dash and H. Liu, "Feature Selection for Classification," Intelligent Data Analysis, Elsevier Science B.V., vol. 1, no. 3, pp. 131-156, 1997. [7] I. Guyon and A. Elisseeff, "An Introduction to Variable and Feature Selection," J. Machine Learning Research, vol. 3, pp. 1157-1182, 2003. [8] L. Jimenez and D. Landgrebe, "Supervised Classification in High Dimensional Space: Geometrical, Statistical and Asymptotical Properties of Multivariate Data," IEEE Trans. Systems, Man and Cybernetics—Part C: Applications and Rev., vol. 28, no. 1, pp. 39-54, Feb. 1998. [9] D. Scott and J. Thompson, "Probability Density Estimation in Higher Dimensions," Proc. 15th Symp. Interface, Elsevier Science Publishers, pp. 173-179, 1983. [10] R. Bellman, Adaptive Control Processes: A Guided Tour. Princeton Univ. Press, 1961. [11] S. Ruadys and A. Jain, "Small Sample Size Effects in Statistical Pattern Recognition: Recommendations for Practitioners," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 13, no. 3, pp. 252-264, Mar. 1991. [12] K. Kira and L.A. Rendell, "A Practical Approach to Feature Selection," Proc. Ninth Int'l Conf. Machine Learning, pp. 249-256, 1992. [13] I. Guyon, J. Watson, S. Barnhill, and V. Vapnik, "Gene Selection for Cancer Classification Using Support Vector Machines," Machine Learning, vol. 46, pp. 389-422, 2002. [14] J.B. Tenenbaum, V. de Silva, and J.C. Langford, "A Global Geometric Framework for Nonlinear Dimensionality Reduction," Science, vol. 290, pp. 2319-2323, 2000. [15] L.K. Saul and S.T. Roweis, "Think Globally, Fit Locally: Unsupervised Learning of Low Dimensional Manifolds," J. Machine Learning Research, vol. 4, pp. 119-155, 2003. [16] L. van der Maaten, E. Postma, and H. van den Herik, "Dimensionality Reduction: A Comparative Review," Technical Report TiCC-TR 2009-005, Tilburg Univ., 2009. [17] R. Kohavi and G. John, "Wrappers for Feature Subset Selection," Artificial Intelligence, vol. 97, pp. 273-324, Dec. 1997. [18] H. Liu and L. Yu, "Toward Integrating Feature Selection Algorithms for Classification and Clustering," IEEE Trans. Knowledge and Data Eng., vol. 17, no. 4, pp. 491-502, Apr. 2005. [19] A.L. Blum and P. Langley, "Selection of Relevant Features and Examples in Machine Learning," Artificial Intelligence, Elsevier B.V., vol. 97, pp. 245-271, 1997. [20] I. Guyon, S. Gunn, M. Nikravesh, and L.A. Zadeh, Feature Extraction Foundations and Applications. Springer, 2006. [21] M. Hall, "Correlation-Based Feature Selection for Discrete and Numeric Class Machine Learning," Proc. 17th Int'l Conf. Machine Learning, 2000. [22] R. Ruiz and J.S. Aguilar-Ruiz, "Analysis of Feature Rankings for Classification," Proc. Int'l Symp. Intelligent Data Analysis (IDA), pp. 362-372, 2005. [23] L. Yu and H. Liu, "Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution," Proc. 20th Int'l Conf. Machine Learning, 2003. [24] F. Fleuret, "Fast Binary Feature Selection with Conditional Mutual Information," J. Machine Learning Research, vol. 5, pp. 1531-1555, 2004. [25] H. Peng, F. Long, and C. Ding, "Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 8, pp. 1226-1238, Aug. 2005. [26] G. Qu, S. Hariri, and M. Yousaf, "A New Dependency and Correlation Analysis for Features," IEEE Trans. Knowledge and Data Eng., vol. 17, no. 9, pp. 1199-1207, Sept. 2005. [27] J. Pearl, Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988. [28] A. Freno, "Selecting Features by Learning Markov Blankets," Proc. 11th Int'l Conf., KES 2007 and XVII Italian Workshop Neural Networks Conf. Knowledge-Based Intelligent Information and Eng. Systems: Part I (KES/WIRN), pp. 69-76, 2007. [29] M. Saeed, "Bernoulli Mixture Models for Markov Blanket Filtering and Classification," J. Machine Learning Research, vol. 3, pp. 77-91, 2008. [30] A. Juan and E. Vidal, "On the Use of Bernoulli Mixture Models for Text Classification," Pattern Recognition, vol. 35, pp. 2705-2710, 2002. [31] A. Juan and E. Vidal, "Bernoulli Mixture Models for Binary Images," Proc. 17th Int'l Conf. Pattern Recognition, (ICPR '04), 2004. [32] "Annual KDD Cup 2001," http://www.sigkdd.orgkddcup/, 2001. [33] R. Agrawal and R. Srikant, "Fast Algorithms for Mining Association Rules," Proc. 20th Int'l Conf. Very Large Databases (VLDB '94), 1994. [34] J. Wilbur, J. Ghosh, C. Nakatsu, S. Brouder, and R. Doerge, "Variable Selection in High-Dimensional Multivariate Binary Data with Application to the Analysis of Microbial Community DNA Fingerprints," Biometrics, vol. 58, pp. 378-386, 2002. [35] I. Guyon et al., "CLOP," http://ymer.org/research/files/clopclop.zip , 2011. [36] M. Saeed, "Hybrid Learning Using Mixture Models and Artificial Neural Networks," Hands-on Pattern Recognition Challenges in Data Representation, Model Selection, and Performance Prediction, http://www.clopinet.comChallengeBook.html , Microtome, 2008. [37] M. Saeed and H. Babri, "Classifiers Based on Bernoulli Mixture Models for Text Mining and Handwriting Recognition," Proc. IEEE Int'l Joint Conf. Neural Networks, 2008. [38] T.M. Cover and J.A. Thomas, Elements of Information Theory. John Wiley and Sons, 1991. [39] L. Jimenez and D.A. Landgrebe, "Projection Pursuit in High Dimensional Data Reduction: Initial Conditions, Feature Selection and the Assumption of Normality," Proc. IEEE Int'l Conf. Systems, Man and Cybernetics, 1995. [40] C.M. Bishop, Pattern Recognition and Machine Learning. Springer, 2006. [41] R.W. Lutz, "Doubleboost," Fact Sheet http://clopinet.com/ isabelle/Projectsagnostic /, 2007. [42] V. Nikulin, "Classification with Random Sets, Boosting and Distance-Based Clustering," Fact Sheet http://clopinet.com/ isabelle/Projectsagnostic /, 2007. [43] V. Franc, "Modified Multi-Class SVM Formulation; Efficient LOO Computation," Fact Sheet http://clopinet.com/isabelle/Projectsagnostic /, 2007. [44] H.J. Escalante, "Particle Swarm Optimization for Neural Networks," Fact Sheet http://clopinet.com/isabelle/Projectsagnostic /, 2007. [45] J. Reunanen, "Cross-Indexing," Fact Sheet http://clopinet.com/isabelle/Projectsagnostic /, 2007. [46] I.C. ASML team, "Feature Selection with Redundancy Elimination $+$ Gradient Boosted Trees," Fact Sheet http://clopinet.com/isabelle/Projectsagnostic /, 2007.