Subscribe
Issue No.09 - September (2010 vol.32)
pp: 1610-1626
Yijun Sun , University of Florida, Gainesville
Sinisa Todorovic , Oregon State University, Corvallis
Steve Goodison , M.D. Anderson Cancer Center-Orlando, Orlando
ABSTRACT
This paper considers feature selection for data classification in the presence of a huge number of irrelevant features. We propose a new feature-selection algorithm that addresses several major issues with prior work, including problems with algorithm implementation, computational complexity, and solution accuracy. The key idea is to decompose an arbitrarily complex nonlinear problem into a set of locally linear ones through local learning, and then learn feature relevance globally within the large margin framework. The proposed algorithm is based on well-established machine learning and numerical analysis techniques, without making any assumptions about the underlying data distribution. It is capable of processing many thousands of features within minutes on a personal computer while maintaining a very high accuracy that is nearly insensitive to a growing number of irrelevant features. Theoretical analyses of the algorithm's sample complexity suggest that the algorithm has a logarithmical sample complexity with respect to the number of features. Experiments on 11 synthetic and real-world data sets demonstrate the viability of our formulation of the feature-selection problem for supervised learning and the effectiveness of our algorithm.
INDEX TERMS
Feature selection, local learning, logistical regression, \ell_1 regularization, sample complexity.
CITATION
Yijun Sun, Sinisa Todorovic, Steve Goodison, "Local-Learning-Based Feature Selection for High-Dimensional Data Analysis", IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.32, no. 9, pp. 1610-1626, September 2010, doi:10.1109/TPAMI.2009.190
REFERENCES
 [1] L. van't Veer et al., "Gene Expression Profiling Predicts Clinical Outcome of Breast Cancer," Nature, vol. 415, pp. 530-536, 2002. [2] Y. Wang et al., "Gene-Expression Profiles to Predict Distant Metastasis of Lymph-Node Negative Primary Breast Cancer," Lancet, vol. 365, pp. 671-679, 2005. [3] V. Vapnik, Statistical Learning Theory. Wiley, 1998. [4] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and V. Vapnik, "Feature Selection for SVMs," Proc. 13th Advances in Neural Information Processing Systems, pp. 668-674, 2001. [5] A.Y. Ng, "Feature Selection, L1 vs. L2 Regularization, and Rotational Invariance," Proc. 21st Int'l Conf. Machine Learning, pp. 78-86, 2004. [6] A.Y. Ng and M.I. Jordan, "Convergence Rates of the Voting Gibbs Classifier, with Application to Bayesian Feature Selection," Proc. 18th Int'l Conf. Machine Learning, pp. 377-384, 2001. [7] J. Lafferty and L. Wasserman, "Challenges in Statistical Machine Learning," Statistica Sinica, vol. 16, pp. 307-322, 2006. [8] M. Hilario and A. Kalousis, "Approaches to Dimensionality Reduction in Proteomic Biomarker Studies," Briefings in Bioinformatics, vol. 9, no. 2, pp. 102-118, 2008. [9] I. Guyon and A. Elisseeff, "An Introduction to Variable and Feature Selection," J. Machine Learning Research, vol. 3, pp. 1157-1182, 2003. [10] R. Kohavi and G.H. John, "Wrappers for Feature Subset Selection," Artificial Intelligence, vol. 97, nos. 1/2, pp. 273-324, 1997. [11] P. Pudil and J. Novovicova, "Novel Methods for Subset Selection with Respect to Problem Knowledge," IEEE Intelligent Systems, vol. 13, no. 2, pp. 66-74, Mar. 1998. [12] D. Koller and M. Sahami, "Toward Optimal Feature Selection," Proc. 13th Int'l Conf. Machine Learning, pp. 284-292, 1996. [13] T.G. Dietterich and G. Bakiri, "Solving Multiclass Learning Problems via Error-Correcting Output Codes," J. Artificial Intelligence Research, vol. 2, pp. 263-286, 1995. [14] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee, "Choosing Multiple Parameters for Support Vector Machines," Machine Learning, vol. 46, no. 1, pp. 131-159, 2002. [15] T.N. Lal, O. Chapelle, J. Weston, and A. Elisseeff, "Embedded Methods," Feature Extraction, Foundations and Applications, I. Guyon, S. Gunn, M. Nikravesh, and L. Zadeh, eds., pp. 137-165, Springer-Verlag, 2006. [16] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, "Gene Selection for Cancer Classification using Support Vector Machines," Machine Learning, vol. 46, nos. 1-3, pp. 389-422, 2002. [17] J. Zhu, S. Rosset, T. Hastie, and R. Tibshirani, "1-Norm Support Vector Machines," Proc. 16th Advances in Neural Information Processing Systems, 2004. [18] K. Kira and L.A. Rendell, "A Practical Approach to Feature Selection," Proc. Ninth Int'l Conf. Machine Learning, pp. 249-256, 1992. [19] R. Gilad-Bachrach, A. Navot, and N. Tishby, "Margin Based Feature Selection—Theory and Algorithms," Proc. 21st Int'l Conf. Machine Learning, pp. 43-50, 2004. [20] R.E. Schapire, Y. Freund, P.L. Bartlett, and W.S. Lee, "Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods," Annals of Statistics, vol. 26, no. 5, pp. 1651-1686, 1998. [21] A. Dempster, N. Laird, and D. Rubin, "Maximum Likelihood from Incomplete Data via the EM Algorithm," J. Royal Statistical Soc., Series B, vol. 39, no. 1, pp. 1-38, 1977. [22] C. Atkeson, A. Moore, and S. Schaal, "Locally Weighted Learning," Artificial Intelligence Rev., vol. 11, no. 15, pp. 11-73, 1997. [23] C.M. Bishop, Pattern Recognition and Machine Learning. Springer, 2006. [24] R. Tibshirani, "Regression Shrinkage and Selection via the Lasso," J. Royal Statistical Soc., Series B, vol. 58, no. 1, pp. 267-288, 1996. [25] M.Y. Park and T. Hastie, "$\ell_1$ Regularization Path Algorithm for Generalized Linear Models," J. Royal Statistical Soc., Series B, vol. 69, no. 4, pp. 659-677, 2007. [26] L. Meier, S. van de Geer, and P. Buhlmann, "The Group Lasso for Logistic Regression," J. Royal Statistical Soc., Series B, vol. 70, pp. 53-71, 2008. [27] V. Roth, "The Generalized LASSO," IEEE Trans. Neural Networks, vol. 15, no. 1, pp. 16-28, Jan. 2004. [28] S. Rosset, "Following Curved Regularized Optimization Solution Paths," Proc. 17th Advances in Neural Information Processing Systems, pp. 1153-1160, 2005. [29] D.L. Donoho and M. Elad, "Optimally Sparse Representations in General Nonorthogonal Dictionaries by $\ell_1$ Minimization," Proc. Nat'l Academy of Sciences USA, vol. 100, no. 5, pp. 2197-2202, 2003. [30] L. Breiman, "Better Subset Regression Using the Nonnegative Garrote," Technometrics, vol. 37, no. 4, pp. 373-384, 1995. [31] R. Kress, Numerical Analysis. Springer-Verlag, 1998. [32] P. Zezula, G. Amato, V. Dohnal, and M. Batko, Similarity Search—The Metric Space Approach. Springer, 2006. [33] T.G. Dietterich, "Machine Learning Research: Four Current Directions," AI Magazine, vol. 18, no. 4, pp. 97-136, 1997. [34] Y. Sun, S. Todorovic, J. Li, and D. Wu, "Unifying Error-Correcting and Output-Code AdaBoost through the Margin Concept," Proc. 22nd Int'l Conf. Machine Learning, pp. 872-879, 2005. [35] A. Asuncion and D. Newman, "UCI Machine Learning Repository," 2007. [36] V. Vapnik and A. Chervonenkis, Theory of Pattern Recognition, (in Russian). Nauka, 1974. [37] D. Pollard, Convergence of Stochastic Processes. Springer-Verlag, 1984. [38] L. Devroye, L. Györfi, and G. Lugosi, A Probabilistic Theory of Pattern Recognition. Springer-Verlag, 1996. [39] M. Anthony and P.L. Bartlett, Neural Network Learning: Theoretical Foundations. Cambridge Univ. Press, 1999. [40] F. Cucker and S. Smale, "On the Mathematical Foundations of Learning," Bull. Am. Math. Soc., vol. 39, no. 1, pp. 1-49, 2002. [41] T. Zhang, "Covering Number Bounds of Certain Regularized Linear Function Classes," J. Machine Learning Research, vol. 2, pp. 527-550, 2002. [42] Y. Sun and J. Li, "Iterative RELIEF for Feature Weighting," Proc. 23rd Int'l Conf. Machine Learning, pp. 913-920, 2006. [43] H. Kushner and G. Yin, Stochastic Approximation and Recursive Algorithms and Applications. Springer-Verlag, 2003. [44] S.T. Roweis and L.K. Saul, "Nonlinear Dimensionality Reduction by Locally Linear Embedding," Science, vol. 290, no. 5500, pp. 2323-2326, 2000. [45] P.L. Bartlett, "The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights Is More Important Than the Size of the Network," IEEE Trans. Information Theory, vol. 44, no. 2, pp. 525-536, Mar. 1998. [46] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov, "Neighbourhood Components Analysis," Proc. 17th Advances in Neural Information Processing Systems, pp. 513-520, 2005. [47] K. Weinberger, J. Blitzer, and L.K. Saul, "Distance Metric Learning for Large Margin Nearest Neighbor Classification," Proc. 18th Advances in Neural Information Processing Systems, pp. 1473-1480, 2006. [48] Y. Sun and D. Wu, "A RELIEF Based Feature Extraction Algorithm," Proc. Eighth SIAM Int'l Conf. Data Mining, pp. 188-195, 2008. [49] I. Kononenko, "Estimating Attributes: Analysis and Extensions of RELIEF," Proc. European Conf. Machine Learning, pp. 171-182, 1994. [50] Y. Sun, S. Goodison, J. Li, L. Liu, and W. Farmerie, "Improved Breast Cancer Prognosis through the Combination of Clinical and Genetic Markers," Bioinformatics, vol. 23, no. 1, pp. 30-37, 2007. [51] R. Horn and C. Johnson, Matrix Analysis. Cambridge Univ. Press, 1985. [52] A.J. Stephenson, A. Smith, M.W. Kattan, J. Satagopan, V.E. Reuter, P.T. Scardino, and W.L. Gerald, "Integration of Gene Expression Profiling and Clinical Variables to Predict Prostate Carcinoma Recurrence after Radical Prostatectomy," Cancer, vol. 104, no. 2, pp. 290-298, 2005. [53] M.A. Shipp et al., "Diffuse Large B-Cell Lymphoma Outcome Prediction by Gene Expression Profiling and Supervised Machine Learning," Nature Medicine, vol. 8, pp. 68-74, 2002.