Issue No.05 - May (2013 vol.35)
pp: 1025-1038
Jianhui Chen , GE Global Res., San Ramon, CA, USA
Lei Tang , Walmart Labs., San Bruno, CA, USA
Jun Liu , Siemens Corp. Res., Princeton, NJ, USA
Jieping Ye , Dept. of Comput. Sci. & Eng., Arizona State Univ., Tempe, AZ, USA
In this paper, we consider the problem of learning from multiple related tasks for improved generalization performance by extracting their shared structures. The alternating structure optimization (ASO) algorithm, which couples all tasks using a shared feature representation, has been successfully applied in various multitask learning problems. However, ASO is nonconvex and the alternating algorithm only finds a local solution. We first present an improved ASO formulation (iASO) for multitask learning based on a new regularizer. We then convert iASO, a nonconvex formulation, into a relaxed convex one (rASO). Interestingly, our theoretical analysis reveals that rASO finds a globally optimal solution to its nonconvex counterpart iASO under certain conditions. rASO can be equivalently reformulated as a semidefinite program (SDP), which is, however, not scalable to large datasets. We propose to employ the block coordinate descent (BCD) method and the accelerated projected gradient (APG) algorithm separately to find the globally optimal solution to rASO; we also develop efficient algorithms for solving the key subproblems involved in BCD and APG. The experiments on the Yahoo webpages datasets and the Drosophila gene expression pattern images datasets demonstrate the effectiveness and efficiency of the proposed algorithms and confirm our theoretical analysis.
Optimization, Algorithm design and analysis, Vectors, Fasteners, Complexity theory, Prediction algorithms, Acceleration, accelerated projected gradient, Multitask learning, shared predictive structure, alternating structure optimization
Jianhui Chen, Lei Tang, Jun Liu, Jieping Ye, "A Convex Formulation for Learning a Shared Predictive Structure from Multiple Tasks", IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.35, no. 5, pp. 1025-1038, May 2013, doi:10.1109/TPAMI.2012.189
[1] C.M. Bishop, Pattern Recognition and Machine Learning. Springer, 2006.
[2] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2009.
[3] R. Caruana, "Multitask Learning," Machine Learning, vol. 28, no. 1, pp. 41-75, 1997.
[4] B. Heisele, T. Serre, M. Pontil, T. Vetter, and T. Poggio, "Categorization by Learning and Combining Object Parts," Proc. Advances in Neural Information Processing System, 2001.
[5] R.K. Ando and T. Zhang, "A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data," J. Machine Learning Research, vol. 6, pp. 1817-1853, 2005.
[6] Y. Xue, X. Liao, L. Carin, and B. Krishnapuram, "Multi-Task Learning for Classification with Dirichlet Process Priors," J. Machine Learning Research, vol. 8, pp. 35-63, 2007.
[7] S. Yu, V. Tresp, and K. Yu, "Robust Multi-Task Learning with $t$ -Processes," Proc. Int'l Conf. Machine Learning, 2007.
[8] A. Argyriou, T. Evgeniou, and M. Pontil, "Convex Multi-Task Feature Learning," Machine Learning, vol. 73, no. 3, pp. 243-272, 2008.
[9] J. Zhou, J. Chen, and J. Ye, MALSAR: Multi-tAsk Learning via StructurAl Regularization. Arizona State Univ., 2012.
[10] R.K. Ando, "BioCreative II Gene Mention Tagging System at IBM Watson," Proc. Second BioCreative Challenge Evaluation Workshop, 2007.
[11] J. Bi, T. Xiong, S. Yu, M. Dundar, and R.B. Rao, "An Improved Multi-Task Learning Approach with Applications in Medical Diagnosis," Proc. European Conf. Machine Learning, 2008.
[12] O. Chapelle, P. Shivaswamy, S. Vadrevu, K. Weinberger, Y. Zhang, and B. Tseng, "Boosted Multi-Task Learning," Machine Learning, vol. 85, pp. 149-173, 2011.
[13] A. Quattoni, M. Collins, and T. Darrell, "Learning Visual Representations Using Images with Captions," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2007.
[14] J. Li, Y. Tian, T. Huang, and W. Gao, "Probabilistic Multi-Task Learning for Visual Saliency Estimation in Video," Int'l J. Computer Vision, vol. 90, no. 2, pp. 150-165, 2010.
[15] S. Thrun and J. O'Sullivan, "Discovering Structure in Multiple Learning Tasks: The TC Algorithm," Proc. Int'l Conf. Machine Learning, 1996.
[16] J. Baxter, "A Model of Inductive Bias Learning," J. Artificial Intelligence Research, vol. 12, pp. 149-198, 2000.
[17] B. Bakker and T. Heskes, "Task Clustering and Gating for Bayesian Multitask Learning," J. Machine Learning Research, vol. 4, pp. 83-99, 2003.
[18] N.D. Lawrence and J.C. Platt, "Learning to Learn with the Informative Vector Machine," Proc. Int'l Conf. Machine Learning, 2004.
[19] A. Schwaighofer, V. Tresp, and K. Yu, "Learning Gaussian Process Kernels via Hierarchical Bayes," Proc. Advances in Neural Information Processing Systems, 2004.
[20] K. Yu, V. Tresp, and A. Schwaighofer, "Learning Gaussian Processes from Multiple Tasks," Proc. Int'l Conf. Machine Learning, 2005.
[21] J. Zhang, Z. Ghahramani, and Y. Yang, "Learning Multiple Related Tasks Using Latent Independent Component Analysis," Proc. Advances in Neural Information Processing System, 2005.
[22] L. Jacob, F. Bach, and J.-P. Vert, "Clustered Multi-Task Learning: A Convex Formulation," Proc. Advances in Neural Information Processing System, 2008.
[23] T. Evgeniou and M. Pontil, "Regularized Multi-Task Learning," Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, 2004.
[24] T. Evgeniou, C.A. Micchelli, and M. Pontil, "Learning Multiple Tasks with Kernel Methods," J. Machine Learning Research, vol. 6, pp. 615-637, 2005.
[25] T. Jebara, "Multi-Task Feature and Kernel Selection for SVMs" Proc. Int'l Conf. Machine Learning, 2004.
[26] G. Obozinski, B. Taskar, and M.I. Jordan, "Multi-Task Feature Selection," Proc. ICML Workshop Structural Knowledge Transfer for Machine Learning, 2006.
[27] A. Argyriou, T. Evgeniou, and M. Pontil, "Multi-Task Feature Learning," Proc. Advances in Neural Information Processing System, 2006.
[28] B. Recht, M. Fazel, and P.A. Parrilo, "Guaranteed Minimum Rank Solutions to Linear Matrix Equations via Nuclear Norm Minimization," SIAM Rev., vol. 52, no. 3, pp. 471-501, 2010.
[29] S. Ji and J. Ye, "An Accelerated Gradient Method for Trace Norm Minimization," Proc. Int'l Conf. Machine Learning, 2009.
[30] T.K. Pong, P. Tseng, S. Ji, and J. Ye, "Trace Norm Regularization: Reformulations, Algorithms, and Multi-Task Learning," SIAM J. Optimization, vol. 20, pp. 3465-3489, 2009.
[31] J. Zhou, J. Chen, and J. Ye, "Clustered Multi-Task Learning via Alternating Structure Optimization," Proc. Advances in Neural Information Processing System, 2011.
[32] D.P. Bertsekas, Nonlinear Programming. Athena Scientific, 1999.
[33] A. Nemirovski, Efficient Methods in Convex Programming, Springer, 1995.
[34] Y. Nesterov, Introductory Lectures on Convex Programming. Kluwer Academic Publishers, 1998.
[35] N. Ueda and K. Saito, "Parametric Mixture Models for Multi-Labeled Text," Proc. Advances in Neural Information Processing System, 2002.
[36] http:/, 2012.
[37] M.L. Overton and R.S. Womersley, "Optimality Conditions and Duality Theory for Minimizing Sums of the Largest Eigenvalues of Symmetric Matrics," Math. Programming, vol. 62, nos. 1-3, pp. 321-357, 1993.
[38] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge Univ. Press, 2004.
[39] A. Argyriou, C.A. Micchelli, M. Pontil, and Y. Ying, "A Spectral Regularization Framework for Multi-Task Structure Learning," Proc. Advances in Neural Information Processing System, 2007.
[40] G.H. Golub and C.F. Van Loan, Matrix Computations. Johns Hopkins Univ. Press, 1996.
[41] J.F. Sturm, "Using SeDuMi 1.02, a MATLAB Toolbox for Optimization over Symmetric Cones," Optimization Methods and Software, vols. 11/12, pp. 625-653, 1999.
[42] C.-C. Chang and C.-J. Lin, "LIBSVM: A Library for Support Vector Machines," ACM Trans. Intelligent Systems and Technology, vol. 2, pp. 27:1-27:27, 2011.
[43] P. Brucker, "An O(n) Algorithm for Quadratic Knapsack Problems," Operations Research Letters, vol. 3, no. 3, pp. 163-166, 1984.
[44] J.-J. Moreau, "Proximité et Dualité dans un Espace Hilbertien," Bull. Soc. Math. France, vol. 93, pp. 273-299, 1965.
[45] J. Liu, S. Ji, and J. Ye, SLEP: Sparse Learning with Efficient Projections. Arizona State Univ., 2009.
[46] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski, "Convex Optimization with Sparsity-Inducing Norms," Optimization for Machine Learning, S. Sra, S. Nowozin, and S.J. Wright, eds., MIT Press, 2011.
[47] J. Chen, J. Liu, and J. Ye, "Learning Incoherent Sparse and Low-Rank Patterns from Multiple Tasks," Proc. 16th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, 2010.
[48] D.D. Lewis, Y. Yang, T.G. Rose, and F. Li, "RCV1: A New Benchmark Collection for Text Categorization Research," J. Machine Learning Research, vol. 5, pp. 361-397, 2004.
[49] D.D. Lewis, "Evaluating Text Categorization," Proc. Speech and Natural Language Workshop, 1991.
[50] C.C. Fowlkes, C.L.L. Hendriks, S.V. Keränen, G.H. Weber, O. Rübel, M.-Y. Huang, S. Chatoor, A.H. DePace, L. Simirenko, C. Henriquez, A. Beaton, R. Weiszmann, S. Celniker, B. Hamann, D.W. Knowles, M.D. Biggin, M.B. Eisen, and J. Malik, "A Quantitative Spatiotemporal Atlas of Gene Expression in the Drosophila Blastoderm," Cell, vol. 133, no. 2, pp. 364-374, 2008.
[51] E. Lécuyer, H. Yoshida, N. Parthasarathy, C. Alm, T. Babak, T. Cerovina, T.R. Hughes, P. Tomancak, and H.M. Krause, "Global Analysis of mRNA Localization Reveals a Prominent Role in Organizing Cellular Architecture and Function," Cell, vol. 131, no. 1, pp. 174-187, 2007.
[52] P. Tomancak, A. Beaton, R. Weiszmann, E. Kwan, S. Shu, S.E. Lewis, S. Richards, M. Ashburner, V. Hartenstein, S.E. Celniker, and G.M. Rubin, "Systematic Determination of Patterns of Gene Expression during Drosophila Embryogenesis," Genome Biology, vol. 3, no. 12, 2002.
[53] S. Ji, L. Yuan, Y.-X. Li, Z.-H. Zhou, S. Kumar, and J. Ye, "Drosophila Gene Expression Pattern Annotation Using Sparse Features and Term-Term Interactions," Proc. 15th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, 2009.
[54] D.G. Lowe, "Distinctive Image Features from Scale-Invariant Keypoints," Int'l J. Computer Vision, vol. 60, no. 2, pp. 91-110, 2004.