This Article 
 Bibliographic References 
 Add to: 
Effective and Efficient Dimensionality Reduction for Large-Scale and Streaming Data Preprocessing
March 2006 (vol. 18 no. 3)
pp. 320-333
Dimensionality reduction is an essential data preprocessing technique for large-scale and streaming data classification tasks. It can be used to improve both the efficiency and the effectiveness of classifiers. Traditional dimensionality reduction approaches fall into two categories: Feature Extraction and Feature Selection. Techniques in the feature extraction category are typically more effective than those in feature selection category. However, they may break down when processing large-scale data sets or data streams due to their high computational complexities. Similarly, the solutions provided by the feature selection approaches are mostly solved by greedy strategies and, hence, are not ensured to be optimal according to optimized criteria. In this paper, we give an overview of the popularly used feature extraction and selection algorithms under a unified framework. Moreover, we propose two novel dimensionality reduction algorithms based on the Orthogonal Centroid algorithm (OC). The first is an Incremental OC (IOC) algorithm for feature extraction. The second algorithm is an Orthogonal Centroid Feature Selection (OCFS) method which can provide optimal solutions according to the OC criterion. Both are designed under the same optimization criterion. Experiments on Reuters Corpus Volume-1 data set and some public large-scale text data sets indicate that the two algorithms are favorable in terms of their effectiveness and efficiency when compared with other state-of-the-art algorithms.

[1] M. Artae, M. Jogan, and A. Leonardis, “Incremental PCA for On-Line Visual Learning and Recognition,” Proc. 16th Int'l Conf. Pattern Recognition, pp. 781-784, 2002.
[2] A.L. Blum and P. Langley, “Selection of Relevant Features and Examples in Machine Learning,” Artificial Intelligence, vol. 97, nos. 1-2, pp. 245-271, 1997.
[3] M. Belkin and P. Niyogi, “Using Manifold Structure for Partially Labelled Classification,” Proc. Conf. Advances in Neural Information Processing, pp. 929-936, 2002.
[4] S.E. Brian and G. Dunn, Applied Multivariate Data Analysis. Edward Ar nold, 2001.
[5] K. Daphne and M. Sahami, “Toward Optimal Feature Selection,” Proc. 13th Int'l Conf. Machine Learning, pp. 284-292, 1996.
[6] W. Fan, M.D. Gordon, and P. Pathak, “Effective Profiling Of Consumer Information Retrieval Needs: A Unified Framework And Empirical Comparison,” Decision Support Systems, vol. 40, pp. 213-233, 2004.
[7] J.E. Gentle, Numerical Linear Algebra for Applications in Statistics. Springer-Verlag, 1998.
[8] K. Hiraoka, K. Hidai, M. Hamahira, H. Mizoguchi, T. Mishima, and S. Yoshizawa, “Successive Learning of Linear Discriminant Analysis: Sanger-Type Algorithm,” Proc. 14th Int'l Conf. Pattern Recognition, pp. 2664-2667, 2000.
[9] R. Hoch, “Using IR Techniques for Text Classification in Document Analysis,” Proc. 17th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 31-40, 1994.
[10] P. Howland and H. Park, “Generalizing Discriminant Analysis Using the Generalized Singular Value Decomposition,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 26, pp. 995-1006, 2004.
[11] M. Jeon, H. Park, and J.B. Rosen, “Dimension Reduction Based on Centroids and Least Squares for Efficient Processing of Text Data,” Technical Report MN TR 01-010, Univ. of Minnesota, Minneapolis, Feb. 2001.
[12] I.T. Jolliffe, Principal Component Analysis. Springer-Verlag, 1986.
[13] J.F. Hair, R.L. Tatham, R.E. Anderson, and W. Black, Multivariate Data Analysis, fifth ed. Prentice Hall, Mar. 1998.
[14] R. Kohavi and G. John, “Wrappers for Feature Subset Selection,” Artificial Intelligence, vol. 97, nos. 1-2, pp. 273-324, 1997.
[15] H.J. Kushner and D.S. Clark, Stochastic Approximation Methods for Constrained and Unconstrained Systems. New York: Springer-Verlag, 1978.
[16] D. Lewis, Y. Yang, T. Rose, and F. Li, “RCV1: A New Benchmark Collection for Text Categorization Research,” J. Machine Learning Research, pp. 361-397, 2003.
[17] D.D. Lewis, “Feature Selection and Feature Extraction for Text Categorization,” Proc. Workshop Speech and Natural Language, pp. 212-217, 1992.
[18] H. Li, T. Jiang, and K. Zhang, “Efficient and Robust Feature Extraction by Maximum Margin Criterion,” Proc. Conf. Advances in Neural Information Processing Systems, pp. 97-104, 2004.
[19] Y. Li, L. Xu, J. Morphett, and R. Jacobs, “An Integrated Algorithm of Incremental and Robust PCA,” Proc. Int'l Conf. Image Processing, pp. 245-248, 2003.
[20] R.-L. Liu and Y.-L. Lu, “Incremental Context Mining for Adaptive Document Classification,” Proc. Eighth ACM Int'l Conf. Knowledge Discovery and Data Mining, pp. 599-604, 2002.
[21] A.M. Martinez and A.C. Kak, “PCA versus LDA,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, pp. 228-233, 2001.
[22] E. Oja, “Subspace Methods of Pattern Recognition,” Pattern Recognition and Image Processing Series, vol. 6, 1983.
[23] H. Park, M. Jeon, and J. Rosen, “Lower Dimensional Representation of Text Data Based on Centroids and Least Squares,” BIT Numerical Math., vol. 43, pp. 427-448, 2003.
[24] J. Platt, “Fast Training of Support Vector Machines Using Sequential Minimal Optimization,” Advances in Kernel Methods: Support Vector Learning, pp. 185-208, 1999.
[25] B.-Y. Ricardo and R.-N. Berthier, Modern Information Retrieval. Addison Wesley Longman, 1999.
[26] R.O. Duda, P.E. Hart, and D.G Stork, Pattern Classification, second ed. John Wiley, 2001.
[27] S.T. Roweis and L.K. Saul, “Nonlinear Dimensionality Reduction by Locally Linear Embedding,” Science, vol. 290, pp. 2323-2326, 2000.
[28] G. Salton and C. Buckley, “Term Weighting Approaches in Automatic Text Retrieval,” Information Processing and Management, vol. 24, pp. 513-523, 1988.
[29] M. Spitters, “Comparing Feature Sets for Learning Text Categorization,” Proc. Int'l Conf. Computer-Assisted Information Retrieval, pp. 233-251, 2000.
[30] J.B. Tenenbaum, V. de Silva, and J.C. Langford, “A Global Geometric Framework for Nonlinear Dimensionality Reduction,” Science, vol. 290, pp. 2319-2323, 2000.
[31] R.J. Vaccaro, SVD and Signal Processing II: Algorithms, Analysis and Applications. Elsevier Science, 1991.
[32] A.R. Webb, Statistical Pattern Recognition, second ed. John Wiley, 2002.
[33] J. Weng, Y. Zhang, and W.-S. Hwang, “Candid Covariance-Free Incremental Principal Component Analysis,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, pp. 1034-1040, 2003.
[34] J. Yan, B.Y. Zhang, S.C. Yan, Z. Chen, W.G. Fan, Q. Yang, W.Y. Ma, and Q.S. Cheng, “IMMC: Incremental Maximum, Marginal Criterion,” Proc. 10th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 725-730, 2004.
[35] J. Yan, N. Liu, B.Y. Zhang, S.C. Yan, Q.S. Cheng, W.G. Fan, Z. Chen, W.S. Xi, and W.Y. Ma, “OCFS: Orthogonal Centroid Feature Selection,” Proc. 28th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, 2005.
[36] Y. Yang and J.O. Pedersen, “A Comparative Study on Feature Selection in Text Categorization,” Proc. 14th Int'l Conf. Machine Learning, pp. 412-420, 1997.

Index Terms:
Index Terms- Feature extraction, feature selection, orthogonal centroid algorithm.
Jun Yan, Benyu Zhang, Ning Liu, Shuicheng Yan, Qiansheng Cheng, Weiguo Fan, Qiang Yang, Wensi Xi, Zheng Chen, "Effective and Efficient Dimensionality Reduction for Large-Scale and Streaming Data Preprocessing," IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 3, pp. 320-333, March 2006, doi:10.1109/TKDE.2006.45
Usage of this product signifies your acceptance of the Terms of Use.