The Community for Technology Leaders
RSS Icon
Issue No.06 - June (2011 vol.33)
pp: 1217-1233
Qiang Cheng , Southern Illinois University Carbondale, Carbondale
Hongbo Zhou , Southern Illinois University Carbondale, Carbondale
Jie Cheng , University of Hawaii at Hilo
Selecting features for multiclass classification is a critically important task for pattern recognition and machine learning applications. Especially challenging is selecting an optimal subset of features from high-dimensional data, which typically have many more variables than observations and contain significant noise, missing components, or outliers. Existing methods either cannot handle high-dimensional data efficiently or scalably, or can only obtain local optimum instead of global optimum. Toward the selection of the globally optimal subset of features efficiently, we introduce a new selector—which we call the Fisher-Markov selector—to identify those features that are the most useful in describing essential differences among the possible groups. In particular, in this paper we present a way to represent essential discriminating characteristics together with the sparsity as an optimization objective. With properly identified measures for the sparseness and discriminativeness in possibly high-dimensional settings, we take a systematic approach for optimizing the measures to choose the best feature subset. We use Markov random field optimization techniques to solve the formulated objective functions for simultaneous feature selection. Our results are noncombinatorial, and they can achieve the exact global optimum of the objective function for some special kernels. The method is fast; in particular, it can be linear in the number of features and quadratic in the number of observations. We apply our procedure to a variety of real-world data, including mid--dimensional optical handwritten digit data set and high-dimensional microarray gene expression data sets. The effectiveness of our method is confirmed by experimental results. In pattern recognition and from a model selection viewpoint, our procedure says that it is possible to select the most discriminating subset of variables by solving a very simple unconstrained objective function which in fact can be obtained with an explicit expression.
Classification, feature subset selection, Fisher's linear discriminant analysis, high-dimensional data, kernel, Markov random field.
Qiang Cheng, Hongbo Zhou, Jie Cheng, "The Fisher-Markov Selector: Fast Selecting Maximally Separable Feature Subset for Multiclass Classification with Applications to High-Dimensional Data", IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.33, no. 6, pp. 1217-1233, June 2011, doi:10.1109/TPAMI.2010.195
[1] P. Domingos and M. Pazzani, “On the Optimality of the Simple Bayesian Classifier under Zero-One Loss,” Machine Learning, vol. 29, pp. 103-130, 1997.
[2] S. Dudoit, J. Fridlyand, and T. Speed, “Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data,” J. Am. Statistical Assoc., vol. 97, pp. 77-87, 2002.
[3] J. Fan and Y. Fan, “High Dimensional Classification Using Features Annealed Independence Rules,” Annals of Statistics, vol. 36, pp. 2232-2260, 2008.
[4] T.M. Cover, “Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition,” IEEE Trans. Electronic Computers, vol. 14, no. 3, pp. 326-334, June 1965.
[5] R.A. Fisher, “The Use of Multiple Measurements in Taxonomic Problems,” Annals of Eugenics, vol. 7, pp. 179-188, 1936.
[6] W.J. Dixon and F.J. Massey, Introduction to Statistical Analysis, second ed. McGraw-Hill, 1957.
[7] M.G. Kendall, A Course in Multivariate Analysis. Griffin, 1957.
[8] P.A. Devijver and J. Kittler, Pattern Recognition: A Statistical Approach. Prentice Hall, 1982.
[9] K. Fukunaga, Introduction to Statistical Pattern Recognition, second ed. Academic Press, 1990.
[10] G.J. McLachlan, Discriminant Analysis and Statistical Pattern Recognition. Wiley, 2004.
[11] K.-S. Fu, Sequential Methods in Pattern Recognition and Machine Learning. Academic Press, 1968.
[12] K.-S. Fu, P.J. Min, and T.J. Li, “Feature Selection in Pattern Recognition,” IEEE Trans. Systems Science and Cybernetics, vol. 6, no. 1, pp. 33-39, Jan. 1970.
[13] C.H. Chen, “On a Class of Computationally Efficient Feature Selection Criteria,” Pattern Recognition, vol. 7, pp. 87-94, 1975.
[14] P. Narendra and K. Fukunaga, “A Branch and Bound Algorithm for Feature Subset Selection,” IEEE Trans. Computers, vol. 26, no. 9, pp. 917-922, Sept. 1977.
[15] J. Rissanen, Stochastic Complexity in Statistical Inquiry. World Scientific Publishing Company, 1989.
[16] K. Kira and L.A. Rendall, “A Practical Approach to Feature Selection,” Proc. Int'l Conf. Machine Learning, pp. 249-256, 1992.
[17] R. Tibshirani, “Regression Shrinkage and Selection via the Lasso,” J. Royol Statistical Soc. Series B: Methodological, vol. 58, pp. 267-288, 1996.
[18] D.L. Donoho and M. Elad, “Optimally Sparse Representation in General (Nonorthogonal) Dictionaries via $l_1$ Minimization,” Proc. Nat'l Academy of Sciences USA, vol. 100, pp. 2197-2202, 2003.
[19] D.L. Donoho, “Compressed Sensing,” IEEE Trans. Information Theory, vol. 52, no. 4, pp. 1289-1306, Apr. 2006.
[20] E.J. Candes, J. Romberg, and T. Tao, “Stable Signal Recovery from Incomplete and Inaccurate Measurements,” Comm. Pure and Applied Math., vol. 59, pp. 1207-1223, 2006.
[21] E. Candes and T. Tao, “The Dantzig Selector: Statistical Estimation When p Is Much Larger Than n,” Annals of Statistics, vol. 35, no. 6, pp. 2313-2351, 2007.
[22] G.J. McLachlan, R.W. Bean, and D. Peel, “A Mixture Model-Based Approach to the Clustering of Microarray Expression Data,” Bioinformatics, vol. 18, pp. 413-422, 2002.
[23] H. Peng, F. Long, and C. Ding, “Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 8, pp. 1226-1238, Aug. 2005.
[24] L. Wang, “Feature Selection with Kernel Class Separability,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 30, no. 9, pp. 1534-1546, Sept. 2008.
[25] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and V. Vapnik, “Feature Selection for SVMs,” Advances in Neural Information Processing Systems, T.K. Leen, T.G. Dietterich, and V. Tresp, eds., pp. 668-674, MIT Press, 2000.
[26] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene Selection for Cancer Classification Using Support Vector Machines,” Machine Learning, vol. 46, nos. 1-3, pp. 389-422, 2002.
[27] A. Webb, Statistical Pattern Recognition, second ed. Wiley, 2002.
[28] H. Liu and H. Motoda, Feature Selection for Knowledge Discovery and Data Mining. Kluwer Academic Publishers, 1999.
[29] L. Breiman, “Random Forests,” Machine Learning, vol. 45, no. 1, pp. 5-32, 2001.
[30] D. Koller and M. Sahami, “Toward Optimal Feature and Subset Selection Problem,” Proc. Int'l Conf. Machine Learning, pp. 284-292, 1996.
[31] E.B. Fowlkes, R. Gnanadesikan, and J.R. Kettenring, “Variable Selection in Clustering and Other Contexts,” Design, Data, and Analysis, C.L. Mallows, ed., pp. 13-34, Wiley, 1987.
[32] R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, second ed. Wiley-Interscience, 2000.
[33] P. Bickel and E. Levina, “Some Theory of Fisher's Linear Discriminant Function, ‘Naive Bayes,’ and Some Alternatives Where There Are Many More Variables Than Observations,” Bernoulli, vol. 10, pp. 989-1010, 2004.
[34] S. Mika, G. Ratsch, and K.-R. Muller, “A Mathematical Programming Approach to the Kernel Fisher Algorithm,” Advances in Neural Information Processing Systems, vol. 13, pp. 591-597, MIT Press, 2001.
[35] V.N. Vapnik, Statistical Learning Theory. Wiley, 1998.
[36] B. Scholkopf and A.J. Smola, Learning with Kernels. MIT Press, 2002.
[37] V.N. Vapnik, The Nature of Statistical Learning Theory. Springer, 1999.
[38] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge Univ. Press, 2004.
[39] S. Fidler, D. Slocaj, and A. Leonardis, “Combining Reconstructive and Discriminative Subspace Methods for Robust Classification and Regression by Subsampling,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 28, no. 3, pp. 337-350, Mar. 2006.
[40] H. Akaike, “A New Look at the Statistical Model Identification,” IEEE Trans. Automatic Control, vol. 19, no. 6, pp. 716-723, Dec. 1974.
[41] G. Schwarz, “Estimating the Dimension of a Model,” Annals of Statistics, vol. 6, pp. 361-379, 1978.
[42] D.P. Foster and E.I. George, “The Risk Inflation Criterion for Multiple Regression,” Annals of Statistics, vol. 22, pp. 1947-1975, 1994.
[43] J. Weston, A. Elisseeff, B. Schlkopf, and M.E. Tipping, “Use of the Zero-Norm with Linear Models and Kernel Methods,” J. Machine Learning Research, vol. 3, pp. 1439-1461, 2003.
[44] P.E. Greenwood and A.N. Shiryayev, Contiguity and the Statistical Invariance Principle. Gordon and Breach, 1985.
[45] M. Rosenblatt, Gaussian and Non-Gaussian Linear Time Series and Random Fields. Springer, 2000.
[46] D. Bosq, Nonparametric Statistics for Stochastic Processes. Springer, 1998.
[47] S. Geman and D. Geman, “Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 6, no. 6, pp. 721-741, Nov. 1984.
[48] G. Winkler, Image Analysis, Random Fields and Dynamic Monte Carlo Methods: A Mathematical Introduction, third ed. Springer-Verlag, 2006.
[49] S. Dai, S. Baker, and S.B. Kang, “An MRF-Based Deinterlacing Algorithm with Exemplar-Based Refinement,” IEEE Trans. Image Processing, vol. 18, no. 5, pp. 956-968, May 2009.
[50] D.S. Hochbaum, “An Efficient Algorithm for Image Segmentation, Markov Random Fields and Related Problems,” J. ACM, vol. 48, no. 2, pp. 686-701, 2001.
[51] J.P. Picard and H.D. Ratliff, “Minimum Cuts and Related Problem,” Networks, vol. 5, pp. 357-370, 1975.
[52] H. Ishikawa, “Exact Optimization for Markov Random Fields with Convex Priors,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 10, pp. 1333-1336, Oct. 2003.
[53] V. Kolmogorov and R. Zabih, “What Energy Can Be Minimized via Graph Cuts?” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 26, no. 2, pp. 147-159, Feb. 2004.
[54] Y. Boykov, O. Veksler, and R. Zabih, “Fast Approximate Energy Minimization via Graph Cuts,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, no. 11, pp. 1222-1239, Nov. 2001.
[55] M. Wainwright, T. Jaakkola, and A. Willsky, “MAP Estimation via Agreement on (Hyper)Trees: Message-Passing and Linear Programming,” IEEE Trans. Information Theory, vol. 51, no. 11, pp. 3697-3717, Nov. 2005.
[56] J. Yedidia, W. Freeman, and Y. Weiss, “Constructing Free Energy Approximations and Generalized Belief Propagation Algorithms,” IEEE Trans. Information Theory, vol. 51, no. 7, pp. 2282-2312, July 2004.
[57] J. Demsar, “Statistical Comparisons of Classifiers over Multiple Data Sets,” J. Machine Learning Research, vol. 7, pp. 1-30, 2006.
[58] C.L. Blake, D.J. Newman, S. Hettich, and C.J. Merz, UCI Repository of Machine Learning Databases, http://www.ics. , 1998.
[59] M.D. Garris, et al., NIST Form-Based Handprint Recognition System, NISTIR 5469, 1994.
[60] T. Golub et al., “Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring,” Science, vol. 286, pp. 531-537, , 1999.
[61] D.T. Ross et al., “Systematic Variation in Gene Expression Patterns in Human Cancer Cell Lines,” Nature Genetics, vol. 24, no. 3, pp. 227-234, 2000.
[62] U. Scherf et al., “A cDNA Microarray Gene Expression Database for the Molecular Pharmacology of Cancer,” Nature Genetics, vol. 24, no. 3, pp. 236-244, 2000.
[63] D. Singh et al., “Gene Expression Correlates of Clinical Prostate Cancer Behavior,” Cancer Cell, vol. 1, pp. 203-209, , 2002.
[64] J.B. Welsh et al., “Analysis of Gene Expression Identifies Candidate Markers and Pharmacological Targets in Prostate Cancer,” Cancer Research, vol. 61, pp. 5974-5978, 2001.
20 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool