The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.11 - Nov. (2012 vol.24)
pp: 2025-2039
Fuzhen Zhuang , The Key Laboratory of Intelligent Information Processing, Beijing
Ping Luo , Hewlett Packard Labs, Beijing
Zhiyong Shen , Hewlett Packard Labs, Beijing
Qing He , Institute of Computing Technology, Chinese Academy of Sciences, Beijing
Yuhong Xiong , Lashou.com, China
Zhongzhi Shi , Institute of Computing Technology, Chinese Academy of Sciences, Beijing
Hui Xiong , Rutgers University, Newark
ABSTRACT
The distribution difference among multiple domains has been exploited for cross-domain text categorization in recent years. Along this line, we show two new observations in this study. First, the data distribution difference is often due to the fact that different domains use different index words to express the same concept. Second, the association between the conceptual feature and the document class can be stable across domains. These two observations actually indicate the distinction and commonality across domains. Inspired by the above observations, we propose a generative statistical model, named Collaborative Dual-PLSA (CD-PLSA), to simultaneously capture both the domain distinction and commonality among multiple domains. Different from Probabilistic Latent Semantic Analysis (PLSA) with only one latent variable, the proposed model has two latent factors y and z, corresponding to word concept and document class, respectively. The shared commonality intertwines with the distinctions over multiple domains, and is also used as the bridge for knowledge transformation. An Expectation Maximization (EM) algorithm is developed to solve the CD-PLSA model, and further its distributed version is exploited to avoid uploading all the raw data to a centralized location and help to mitigate privacy concerns. After the training phase with all the data from multiple domains we propose to refine the immediate outputs using only the corresponding local data. In summary, we propose a two-phase method for cross-domain text classification, the first phase for collaborative training with all the data, and the second step for local refinement. Finally, we conduct extensive experiments over hundreds of classification tasks with multiple source domains and multiple target domains to validate the superiority of the proposed method over existing state-of-the-art methods of supervised and transfer learning. It is noted to mention that as shown by the experimental results CD-PLSA for the collaborative training is more tolerant of distribution differences, and the local refinement also gains significant improvement in terms of classification accuracy.
INDEX TERMS
Data models, Training, Mathematical model, Collaboration, Joints, Graphical models, Companies, classification, Statistical generative models, cross-domain learning, distinction and commonality
CITATION
Fuzhen Zhuang, Ping Luo, Zhiyong Shen, Qing He, Yuhong Xiong, Zhongzhi Shi, Hui Xiong, "Mining Distinction and Commonality across Multiple Domains Using Generative Model for Text Classification", IEEE Transactions on Knowledge & Data Engineering, vol.24, no. 11, pp. 2025-2039, Nov. 2012, doi:10.1109/TKDE.2011.143
REFERENCES
[1] F.Z. Zhuang, P. Luo, Z.Y. Shen, Q. He, Y.H. Xiong, Z.Z. Shi, and H. Xiong, "Collaborative Dual-PLSA: Mining Distinction and Commonality across Multiple Domains for Text Classification," Proc. 19th ACM Conf. Information and Knowledge Management (CIKM '10), 2010.
[2] W.Y. Dai, Y.Q. Chen, G.R. Xue, Q. Yang, and Y. Yu, "Translated Learning: Transfer Learning across Different Feature Spaces," Proc. Advances in Neural Information Processing Systems (NIPS), 2008.
[3] J. Gao, W. Fan, J. Jiang, and J.W. Han, "Knowledge Transfer via Multiple Model Local Structure Mapping," Proc. 14th ACM Int'l Conf. Knowledge Discovery and Data Mining (SIGKDD), pp. 283-291, 2008.
[4] D. Xing, W. Dai, G. Xue, and Y. Yu, "Bridged Refinement for Transfer Learning," Proc. 11th European Conf. Practice of Knowledge Discovery in Databases (PKDD), pp. 324-335, 2007.
[5] J. Gao, W. Fan, Y.Z. Sun, and J.W. Han, "Heterogeneous Source Consensus Learning via Decision Propagation and Negotiation," Proc. 15th ACM Int'l Conf. Knowledge Discovery and Data Mining (SIGKDD), pp. 283-291, 2009.
[6] W. Dai, G. Xue, Q. Yang, and Y. Yu, "Co-clustering Based Classification for Out-of-Domain Documents," Proc. 13th ACM Int'l Conf. Knowledge Discovery and Data Mining (SIGKDD), pp. 210-219, 2007.
[7] P. Luo, F.Z. Zhuang, H. Xiong, Y.H. Xiong, and Q. He, "Transfer Learning from Multiple Source Domains via Consensus Regularization," Proc. 17th ACM Conf. Information and Knowledge Management (CIKM), pp. 103-112, 2008.
[8] W. Dai, Q. Yang, G. Xue, and Y. Yu, "Boosting for Transfer Learning," Proc. 24th Int'l Conf. Machine Learning (ICML), pp. 193-200, 2007.
[9] G.R. Xue, W.Y. Dai, Q. Yang, and Y. Yu, "Topic-bridged PLSA for Cross-Domain Text Classification," Proc. 31st ACM Ann. Int'l Conf. Research and Development in Information Retrieval (SIGIR '08), pp. 627-634, 2008.
[10] W.Y. Dai, O. Jin, G.R. Xue, Q. Yang, and Y. Yu, "Eigen Transfer: A Unified Framework for Transfer Learning," Proc. 26th Ann. Int'l Conf. Machine Learning (ICML), pp. 193-200, 2009.
[11] J. Jiang and C.X. Zhai, "A Two-Stage Approach to Domain Adaptation for Statistical Classifiers," Proc. 16th ACM Conf. Information and Knowledge Management (CIKM), pp. 401-410, 2007.
[12] F.Z. Zhuang, P. Luo, H. Xiong, Q. He, Y.H. Xiong, and Z.Z. Shi, "Exploiting Associations between Word Clusters and Document Classes for Cross-Domain Text Categorization," Proc. SIAM Int'l Conf. Data Mining (SDM), pp. 13-24, 2010.
[13] S.J. Pan, J.T. Kwok, and Q. Yang, "Transfer Learning via Dimensionality Reduction," Proc. 23rd Conf. Artificial Intelligence (AAAI), pp. 677-682, 2008.
[14] Q.Q. Gu and J. Zhou, "Learning the Shared Subspace for Multi-task Clustering and Transductive Transfer Classification," Proc. Int'l Conf. Data Mining (ICDM), 2009.
[15] S.H. Xie, W. Fan, J. Peng, O. Verscheure, and J.T. Ren, "Latent Space Domain Transfer between High Dimensional Overlapping Distributions," Proc. ACM Conf. World Wide Web (WWW), pp. 91-100, 2009.
[16] J. Jiang and C.X. Zhai, "Instance Weighting for Domain Adaptation in NLP," Proc. 45th Ann. Meeting of the Assoc. for Computational Linguistics (ACL), pp. 264-271, 2007.
[17] M. Dredze, A. Kulesza, and K. Crammer, "Multi-Domain Learning by Confidence-Weighted Parameter Combination," J. Machine Learning, vol. 79, pp. 123-149, 2009.
[18] C.X. Zhai, A. Velivelli, and B. Yu, "A Cross-collection Mixture Model for Comparative Text Mining," Proc. 10th ACM SIGKDD Conf. Knowledge Discovery and Data Mining (KDD), pp. 743-748, 2004.
[19] V.N. Vapnik, Statictic Learning Theory. Wiely-Interscience, 1998.
[20] T. Hofmann, "Probabilistic Latent Semantic Analysis," Proc. 15th Conf. Uncertainty in Artificial Intelligence (UAI), pp. 289-296, 1999.
[21] Y. Jiho and S.J. Choi, "Probabilistic Matrix Tri-factorization," Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing, pp. 1553-1556, 2009.
[22] A.P. Dempster, N.M. Laird, and D.B. Rubin, "Maximum Likelihood from Incomplete Data via the Em Algorithm," J. Royal Statistical Soc., vol. 39, no. 1, pp. 1-38, 1977.
[23] S. Borman, "The Expectation Maximization Algorithm," http://www.seanborman.compublications, Technical report, A Short Tutorial, Unpublished Paper, 2004.
[24] D. Hosmer and S. Lemeshow, Applied Logistic Regression. Wiley, 2000.
[25] C.C. Chang and C.J. Lin, "LIBSVM: A Library for Support Vector Machines," http://www.csie.ntu.edu.tw/cjlinlibsvm, 2001.
171 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool