Subscribe

Issue No.07 - July (2011 vol.23)

pp: 1103-1117

Jun Yan , Microsoft Research Asia, Beijing, China

Ning Liu , Microsoft Research Asia, Beijing

Shuicheng Yan , National University of Singapore, Singapore

Qiang Yang , Hong Kong University of Science and Technology, Hong Kong

Weiguo (Patrick) Fan , Virginia Polytechnic Institute and State University, Blacksburg, VA

Wei Wei , Tsinghua University, Beijing

Zheng Chen , Microsoft Research Asia, Beijing

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2010.34

ABSTRACT

Dimension reduction for large-scale text data is attracting much attention nowadays due to the rapid growth of the World Wide Web. We can categorize those popular dimension reduction algorithms into two groups: feature extraction and feature selection algorithms. In the former, new features are combined from their original features through algebraic transformation. Though many of them have been validated to be effective, these algorithms are typically associated with high computational overhead, making them difficult to be applied on real-world text data. In the latter, subsets of features are selected directly. These algorithms are widely used in real-world tasks owing to their efficiency, but are often based on greedy strategies rather than optimal solutions. An important problem remains: it has been troublesome to integrate these two types of algorithms into a single framework, making it difficult to reap the benefits from both. In this paper, we formulate the two algorithm categories through a unified optimization framework, under which we develop a novel feature selection algorithm called Trace-Oriented Feature Analysis (TOFA). In detail, we integrate the objective functions of several state-of-the-art feature extraction algorithms into a unified one under the optimization framework, and then we propose to optimize this objective function in the solution space of feature selection algorithms for dimensionality reduction. Since the proposed objective function of TOFA integrates many prominent feature extraction algorithms' objective functions, such as unsupervised Principal Component Analysis (PCA) and supervised Maximum Margin Criterion (MMC), TOFA can handle both supervised and unsupervised problems. In addition, by tuning a weight value, TOFA is also suitable to solve semisupervised learning problems. Experimental results on several real-world data sets validate the effectiveness and efficiency of TOFA in text data for dimensionality reduction purpose.

INDEX TERMS

Algebraic algorithms, computations on matrices, document analysis, global optimization.

CITATION

Jun Yan, Ning Liu, Shuicheng Yan, Qiang Yang, Weiguo (Patrick) Fan, Wei Wei, Zheng Chen, "Trace-Oriented Feature Analysis for Large-Scale Text Data Dimension Reduction",

*IEEE Transactions on Knowledge & Data Engineering*, vol.23, no. 7, pp. 1103-1117, July 2011, doi:10.1109/TKDE.2010.34REFERENCES

- [1] N.J. Belkin and W.B. Croft, "Retrieval Techniques,"
Ann. Rev. of Information Science and Technology, vol. 22, pp. 109-145, 1987.- [2] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R. Harshman, "Indexing by Latent Semantic Analysis,"
J. Am. Soc. for Information Science, vol. 41, pp. 391-407, 1990.- [3] S.T. Dumais, "LSI Meets TREC: A Status Report,"
Proc. First Text Retrieval Conf. (TREC), pp. 137-152, 1992.- [4] S.T. Dumais, "Using LSI for Information Filtering: TREC-3 Experiments,"
Proc. Third Text Retrieval Conf. (TREC), 1995.- [5] S.T. Dumais, "Combining Evidence for Effective Information Filtering,"
Proc. AAAI Spring Symp. Machine Learning and Information Retrieval, pp. 26-30, 1996.- [6] S.T. Dumais, "Latent Semantic Indexing (LSI) and TREC-2,"
Proc. Second Text Retrieval Conf. (TREC), pp. 105-116, 1993.- [7] S.T. Dumais, G.W. Furnas, T.K. Landauer, and S. Deerwester, "Using Latent Semantic Analysis to Improve Information Retrieval,"
Proc. ACM Conf. Human Factors in Computing Systems (CHI '88), pp. 281-285, 1988.- [8] S.T. Dumais and J. Nielsen, "Automating the Assignment of Submitted Manuscripts to Reviewers,"
Proc. ACM SIGIR '92, pp. 233-244, 1992.- [9] R.O. Duda, P.E. Hart, and D.G. Stork,
Pattern Classification, second ed., Wiley, 2000.- [10] F. George, "An Extensive Empirical Study of Feature Selection Metrics for Text Classification,"
J. Machine Learning Research, vol. 3, pp. 1289-1305, 2003.- [11] E. Greengrass, "Information Retrieval: A Survey," Technical Report CS-TR-3514, Univ. of Maryland, 2000.
- [12] X. He, "Locality Preserving Projections," PhD thesis, Computer Science Dept, the Univ. of Chicago, 2005.
- [13] X. He, "Incremental Semi-Supervised Subspace Learning for Image Retrieval,"
Proc. 12th Ann. ACM Int'l Conf. Multimedia, pp. 2-8, 2004.- [14] K. Hiraoka and M. Hamahira, "On Successive Learning Type Algorithm for Linear Discriminant Analysis,"
IEIC Technical Report, (in Japanese), vol. 99, pp. 85-92, 1999.- [15] P. Howland and H. Park, "Generalizing Discriminant Analysis Using the Generalized Singular Value Decomposition,"
IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 26, no. 8, pp. 995-1006, Aug. 2004.- [16] M. Jeon, H. Park, and J.B. Rosen, "Dimension Reduction Based on Centroids and Least Squares for Efficient Processing of Text Data," Technical Report 01-010, Univ. of Minnesota, 2001.
- [17] I.T. Jolliffe,
Principal Component Analysis. Spriger-Verlag, 1986.- [18] D.D. Lewis, "Feature Selection and Feature Extraction for Text Categorization,"
Proc. Speech and Natural Language Workshop, pp. 212-217, 1992.- [19] D. Lewis, Y. Yang, T. Rose, and F. Li, "RCV1: A New Benchmark Collection for Text Categorization Research,"
J. Machine Learning Research, vol. 5, pp. 361-397, 2004.- [20] H. Li, T. Jiang, and K. Zhang, "Efficient and Robust Feature Extraction by Maximum Margin Criterion,"
Proc. Advances in Neural Information Processing Systems, pp. 97-104, 2003.- [21] M.L. Littman, S.T. Dumais, and T.K. Landauer, "Automatic Cross-Linguistic Information Retrieval Using Latent Semantic Indexing,"
Proc. Cross Language Information Retrieval, 1997.- [22] T. Liu, S. Liu, Z. Chen, and W.Y. Ma, "An Evaluation on Feature Selection for Text Clustering,"
Proc. 20th Int'l Conf. Machine Learning, pp. 484-495, 2003.- [23] A.M. Martinez and A.C. Kak, "PCA versus LDA,"
IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, no. 2, pp. 228-233, Feb. 2001.- [24] S.T. Roweis and L.K. Saul, "Nonlinear Dimensionality Reduction by Locally Linear Embedding,"
Science, vol. 290, pp. 2323-2326, 2000.- [25] B. Tang, X. Luo, M.I. Heywood, and M. Shepherd, "A Comparative Study of Dimension Reduction Techniques for Document Clustering," Technical Report CS-2004-14, Faculty of Computer Science, Dalhousie Univ., 2004.
- [26] C. Tang, Z. Xu, and S. Dwarkadas, "Peer-to-Peer Information Retrieval Using Self-Organizing Semantic Overlay Networks,"
Proc. ACM SIGCOMM '03, pp. 175-186, 2003.- [27] C. Tang, Z. Xu, and S. Dwarkadas, "On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems,"
Proc. ACM SIGIR '04, pp. 112 -121, 2004.- [28] J.B. Tenenbaum, V. de Silva, and J.C. Langford, "A Global Geometric Framework for Nonlinear Dimensionality Reduction,"
Science, vol. 290, pp. 2319-2323, 2009.- [29] J. Thorsten, "Making Large-Scale SVM Learning Practical,"
Advances in Kernel Methods: Support Vector Learning, B. Schölkopf, C. Burges, and A. Smola, eds., MIT Press, 1999.- [30] M.E. Wall, R. Andreas, and M.R. Luis, "Singular Value Decomposition and Principal Component Analysis,"
A Practical Approach to Microarray Data Analysis, pp. 91-109, Kluwer Academic Publishers, 2003.- [31] J. Weng, Y. Zhang, and W.-S. Hwang, "Candid Covariance-Free Incremental Principal Component Analysis,"
IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 8, pp. 1034-1040, Aug. 2003.- [32] J. Yan, Zhang, B.S. Yan, Z. Chen, W. Fan, Q. Yang, W.Y. Ma, and Q. Cheng, "IMMC: Incremental Maximum, Marginal Criterion,"
Proc. 10th ACM SIGKDD, pp. 725-730, 2004.- [33] J. Yan, B. Zhang, N. Liu, S. Yan, Q. Cheng, W. Fan, Q. Yang, W. Xi, and Z. Chen, "Effective and Efficient Dimensionality Reduction for Large-Scale and Streaming Data Preprocessing,"
IEEE Trans. Knowledge and Data Eng., vol. 18, no. 3, pp. 320-333, Mar. 2006.- [34] J. Yan, N. Liu, B. Zhang, S. Yan, Z. Chen, Q. Cheng, W. Fan, and W.Y. Ma, "OCFS: Optimal Orthogonal Centroid Feature Selection for Text Categorization,"
Proc. ACM SIGIR, pp. 122-129, 2005.- [35] J. Yan, N. Liu, B. Zhang, S. Yan, and Z. Chen, "A Novel Scalable Algorithm for Supervised Subspace Learning,"
Proc. Sixth IEEE Int'l Conf. Data Mining, pp. 721-730, 2007.- [36] S. Yan, D. Xu, B. Zhang, H. Zhang, Q. Yang, and S. Lin, "Graph Embedding and Extension: A General Framework for Dimensionality Reduction,"
IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 29, no. 1, pp. 40-51, Jan. 2007.- [37] Y. Yang and X. Liu, "A Re-Examination of Text Categorization Methods,"
Proc. ACM SIGIR, pp. 42-49, 1999.- [38] Y. Yang, "Noise Reduction in a Statistical Approach to Text Categorization,"
Proc. ACM SIGIR '95, pp. 256-263, 1995.- [39] Y. Yang and J.O. Pedersen, "A Comparative Study on Feature Selection in Text Categorization,"
Proc. 14th Int'l Conf. Machine Learning, pp. 412-420, 1997.- [40] O. Zamir and O.G. Etzioni, "A Dynamic Clustering Interface to Web Search Results,"
Proc. Eighth Int'l World Wide Web Conf. (WWW8), May 1999.- [41] LSA @ CU Boulder, http:/lsa.colorado.edu/, 2010.
- [42] F. Cozman, I. Cohen, and M. Cirelo, "Semi-Supervised Learning of Mixture Models,"
Proc. 20th Int'l Conf. Machine Learning, pp. 99-106, 2003.- [43] D. Crabtree, G. Xiaoying, and P. Andreae, "Standardized Evaluation Method for Web Clustering Results 2005,"
Proc. 2005 IEEE/WIC/ACM Int'l Conf. Web Intelligence, pp. 280-283, 2005. |