Subscribe
Issue No.03 - March (2014 vol.26)
pp: 698-710
Jialei Wang , Sch. of Comput. Eng., Nanyang Technol. Univ. Singapore, Singapore, Singapore
Peilin Zhao , Sch. of Comput. Eng., Nanyang Technol. Univ. Singapore, Singapore, Singapore
Steven C. H. Hoi , Sch. of Comput. Eng., Nanyang Technol. Univ. Singapore, Singapore, Singapore
Rong Jin , Dept. of Comput. Sci. & Eng., Michigan State Univ., East Lansing, MI, USA
ABSTRACT
Feature selection is an important technique for data mining. Despite its importance, most studies of feature selection are restricted to batch learning. Unlike traditional batch learning methods, online learning represents a promising family of efficient and scalable machine learning algorithms for large-scale applications. Most existing studies of online learning require accessing all the attributes/features of training instances. Such a classical setting is not always appropriate for real-world applications when data instances are of high dimensionality or it is expensive to acquire the full set of attributes/features. To address this limitation, we investigate the problem of online feature selection (OFS) in which an online learner is only allowed to maintain a classifier involved only a small and fixed number of features. The key challenge of online feature selection is how to make accurate prediction for an instance using a small number of active features. This is in contrast to the classical setup of online learning where all the features can be used for prediction. We attempt to tackle this challenge by studying sparsity regularization and truncation techniques. Specifically, this article addresses two different tasks of online feature selection: 1) learning with full input, where an learner is allowed to access all the features to decide the subset of active features, and 2) learning with partial input, where only a limited number of features is allowed to be accessed for each instance by the learner. We present novel algorithms to solve each of the two problems and give their performance analysis. We evaluate the performance of the proposed algorithms for online feature selection on several public data sets, and demonstrate their applications to real-world problems including image classification in computer vision and microarray gene expression analysis in bioinformatics. The encouraging results of our experiments validate the efficacy and efficiency of the proposed techniques.
INDEX TERMS
Classification algorithms, Training, Prediction algorithms, Data mining, Machine learning algorithms, Algorithm design and analysis, Bioinformatics,big data analytics, Feature selection, online learning, large-scale data mining, classification
CITATION
Jialei Wang, Peilin Zhao, Steven C. H. Hoi, Rong Jin, "Online Feature Selection and Its Applications", IEEE Transactions on Knowledge & Data Engineering, vol.26, no. 3, pp. 698-710, March 2014, doi:10.1109/TKDE.2013.32
REFERENCES
 [1] R. Bekkerman, R. El-Yaniv, N. Tishby, and Y. Winter, "Distributional Word Clusters versus Words for Text Categorization," J. Machine Learning Research, vol. 3, pp. 1183-1208, 2003. [2] J. Bi, K.P. Bennett, M.J. Embrechts, C.M. Breneman, and M. Song, "Dimensionality Reduction via Sparse Support Vector Machines," J. Machine Learning Research, vol. 3, pp. 1229-1243, 2003. [3] G. Cavallanti, N. Cesa-Bianchi, and C. Gentile, "Tracking the Best Hyperplane with a Simple Budget Perceptron," Machine Learning, vol. 69, nos. 2-3, pp. 143-167, 2007. [4] N. Cesa-Bianchi, S. Shalev-Shwartz, and O. Shamir, "Efficient Learning with Partially Observed Attributes," J. Machine Learning Research, vol. 12, pp. 2857-2878, 2011. [5] A.B. Chan, N. Vasconcelos, and G.R.G. Lanckriet, "Direct Convex Relaxations of Sparse SVM," Proc. 24th Int'l Conf. Machine Learning (ICML '07), pp. 145-153, 2007. [6] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer, "Online Passive-Aggressive Algorithms," J. Machine Learning Research, vol. 7, pp. 551-585, 2006. [7] K. Crammer, M. Dredze, and F. Pereira, "Exact Convex Confidence-Weighted Learning," Proc. Advances in Neural Information Processing Systems (NIPS '08), pp. 345-352, 2008. [8] K. Crammer, A. Kulesza, and M. Dredze, "Adaptive Regularization of Weight Vectors," Proc. Advances in Neural Information Processing Systems (NIPS '09), pp. 414-422, 2009. [9] M. Dash and V. Gopalkrishnan, "Distance Based Feature Selection for Clustering Microarray Data," Proc. 13th Int'l Conf. Database Systems for Advanced Applications (DASFAA '08), pp. 512-519, 2008. [10] M. Dash and H. Liu, "Feature Selection for Classification," Intelligent Data Analysis, vol. 1, nos. 1-4, pp. 131-156, 1997. [11] O. Dekel, S. Shalev-Shwartz, and Y. Singer, "The Forgetron: A Kernel-Based Perceptron on a Budget," SIAM J. Computing, vol. 37, no. 5, pp. 1342-1372, 2008. [12] C.H.Q. Ding and H. Peng, "Minimum Redundancy Feature Selection from Microarray Gene Expression Data," J. Bioinformatics and Computational Biology, vol. 3, no. 2, pp. 185-206, 2005. [13] D. Donoho, "Compressed Sensing," IEEE Trans. Information Theory, vol. 52, no. 4, pp. 1289-1306, Apr. 2006. [14] M. Dredze, K. Crammer, and F. Pereira, "Confidence-Weighted Linear Classification," Proc. 25th Int'l Conf. Machine Learning (ICML '08), pp. 264-271, 2008. [15] J. Duchi and Y. Singer, "Efficient Online and Batch Learning Using Forward Backward Splitting," J. Machine Learning Research, vol. 10, pp. 2899-2934, 2009. [16] Y. Freund and R.E. Schapire, "Large Margin Classification Using the Perceptron Algorithm," Machine Learning, vol. 37, no. 3, pp. 277-296, 1999. [17] C. Gentile, "A New Approximate Maximal Margin Classification Algorithm," J. Machine Learning Research, vol. 2, pp. 213-242, 2001. [18] K.A. Glocer, D. Eads, and J. Theiler, "Online Feature Selection for Pixel Classification," Proc. 22nd Int'l Conf. Machine Learning (ICML '05), pp. 249-256, 2005. [19] I. Guyon and A. Elisseeff, "An Introduction to Variable and Feature Selection," J. Machine Learning Research, vol. 3, pp. 1157-1182, 2003. [20] X. He, D. Cai, and P. Niyogi, "Laplacian Score for Feature Selection," Proc. Neural Information Processing Systems (NIPS '05), 2005. [21] S.C.H. Hoi, R. Jin, P. Zhao, and T. Yang, "Online Multiple Kernel Classification," Machine Learning, vol. 90, no. 2, pp. 289-316, 2013. [22] S.C.H. Hoi, J. Wang, and P. Zhao, "LIBOL: A Library for Online Learning Algorithms," Nanyang Technological Univ., 2012. [23] S.C.H. Hoi, J. Wang, P. Zhao, and R. Jin, "Online Feature Selection for Mining Big Data," Proc. First Int'l Workshop Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications (BigMine '12), pp. 93-100, 2012. [24] J. Kivinen, A.J. Smola, and R.C. Williamson, "Online Learning with Kernels," Proc. Advances in Neural Information Processing Systems, pp. 785-792, 2001. [25] R. Kohavi and G.H. John, "Wrappers for Feature Subset Selection," Artificial Intelligence, vol. 97, nos. 1-2, pp. 273-324, 1997. [26] A. Krizhevsky, "Learning Multiple Layers of Features from Tiny Images," technical report, Univ. of Toronto, 2009. [27] J. Langford, L. Li, and T. Zhang, "Sparse Online Learning via Truncated Gradient," J. Machine Learning Research, vol. 10, pp. 777-801, 2009. [28] H. Liu and L. Yu, "Toward Integrating Feature Selection Algorithms for Classification and Clustering," IEEE Trans. Knowledge and Data Eng., vol. 17, no. 4, pp. 491-502, Apr. 2005. [29] Z. Ma, Y. Yang, F. Nie, J.R.R. Uijlings, and N. Sebe, "Exploiting the Entire Feature Space with Sparsity for Automatic Image Annotation," Proc. 19th ACM Int'l Conf. Multimedia , pp. 283-292, 2011. [30] J.G. March, "Exploration and Exploitation in Organizational Learning," Organization Science, vol. 2, no. 1, pp. 71-87, 1991. [31] F. Orabona, J. Keshet, and B. Caputo, "The Projectron: A Bounded Kernel-Based Perceptron," Proc. Int'l Conf. Machine Learning, pp. 720-727, 2008. [32] H. Peng, F. Long, and C.H.Q. Ding, "Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy," IEEE Trans. Pattern Analysis Machine Intelligence, vol. 27, no. 8, pp. 1226-1238, Aug. 2005. [33] S. Perkins and J. Theiler, "Online Feature Selection Using Grafting," Int'l Conf. Machine Learning (ICML '03), pp. 592-599, 2003. [34] J. Ren, Z. Qiu, W. Fan, H. Cheng, and P.S. Yu, "Forward Semi-Supervised Feature Selection," Proc. 12th Pacific-Asia Conf. Advances in Knowledge Discovery and Data Mining (PAKDD '08), pp. 970-976, 2008. [35] F. Rosenblatt, "The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain," Psychological Rev., vol. 65, pp. 386-407, 1958. [36] A. Rostamizadeh, A. Agarwal, and P.L. Bartlett, "Learning with Missing Features," Proc. Conf. Uncertainty in Artificial Intelligence (UAI '11), pp. 635-642, 2011. [37] Y. Saeys, I. Inza, and P. Larrañaga, "A Review of Feature Selection Techniques in Bioinformatics," Bioinformatics, vol. 23, no. 19, pp. 2507-2517, 2007. [38] U. Alon, N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D. Mack, and A.J. Levine, "Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays," Proc. Nat'l Academy of Sciences USA, pp. 6745-6750, 1999. [39] J. Wang, P. Zhao, and S.C.H. Hoi, "Cost-Sensitive Online Classification," Proc. IEEE 12th Int'l Conf. Data Mining (ICDM '12), pp. 1140-1145, 2012. [40] M.K. Warmuth and D. Kuzmin, "Randomized Pca Algorithms with Regret Bounds That are Logarithmic in the Dimension," Proc. Advances in Neural Information Processing Systems 19 (NIPS '06), 2006. [41] X. Wu, K. Yu, H. Wang, and W. Ding, "Online Streaming Feature Selection," Proc. Int'l Conf. Machine Learning (ICML '10), pp. 1159-1166, 2010. [42] Z. Xu, R. Jin, J. Ye, M.R. Lyu, and I. King, "Non-Monotonic Feature Selection," Proc. Int'l Conf. Machine Learning (ICML '09), p. 144, 2009. [43] Z. Xu, I. King, M.R. Lyu, and R. Jin, "Discriminative Semi-Supervised Feature Selection via Manifold Regularization," IEEE Trans. Neural Networks, vol. 21, no. 7, pp. 1033-1047, July 2010. [44] S. Yang, L. Yuan, Y.-C. Lai, X. Shen, P. Wonka, and J. Ye, "Feature Grouping and Selection over an Undirected Graph," Proc. 18th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '12), pp. 922-930, 2012. [45] Y. Yang, H.T. Shen, Z. Ma, Z. Huang, and X. Zhou, "${\rm l}_{{2, 1}}$ -Norm Regularized Discriminative Feature Selection for Unsupervised Learning," Proc. 22nd Int'l Joint Conf. Artificial Intelligence (IJCAI '11), pp. 1589-1594, 2011. [46] L. Yu and H. Liu, "Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution," Proc. Int'l Conf. Machine Learning (ICML '03), pp. 856-863, 2003. [47] P. Zhao and S.C.H. Hoi, "OTL: A Framework of Online Transfer Learning," Proc. Int'l Conf. Machine Learning (ICML '10), pp. 1231-1238, 2010. [48] P. Zhao and S.C.H. Hoi, "Bduol: Double Updating Online Learning on a Fixed Budget," Proc. European Conf. Machine Learning and Knowledge Discovery in Databases (ECML/PKDD '12), no. 1, pp. 810-826, 2012. [49] P. Zhao, S.C.H. Hoi, and R. Jin, "Double Updating Online Learning," J. Machine Learning Research, vol. 12, pp. 1587-1615, 2011. [50] P. Zhao, S.C.H. Hoi, R. Jin, and T. Yang, "Online Auc Maximization," Proc. Int'l Conf. Machine Learning (ICML '11), pp. 233-240, 2011. [51] P. Zhao, J. Wang, P. Wu, R. Jin, and S.C.H. Hoi, "Fast Bounded Online Gradient Descent Algorithms for Scalable Kernel-Based Online Learning," Proc. Int'l Conf. Machine Learning (ICML '12), 2012. [52] Z. Zhao and H. Liu, "Semi-Supervised Feature Selection via Spectral Analysis," Proc. SIAM Int'l Conf. Data Mining (SDM '07), 2007. [53] Z. Zhao and H. Liu, "Spectral Feature Selection for Supervised and Unsupervised Learning," Proc. Int'l Conf. Machine Learning (ICML '07), pp. 1151-1157, 2007. [54] Y. Zhou, R. Jin, and S.C.H. Hoi, "Exclusive Lasso for Multi-Task Feature Selection," J. Machine Learning Research - Proc. Track, vol. 9, pp. 988-995, 2010.