Subscribe
Issue No.08 - Aug. (2013 vol.35)
pp: 1944-1957
Brian Hutchinson , University of Washington, Seattle
Li Deng , Microsoft Research, Redmond
Dong Yu , Microsoft Research, Redmond
ABSTRACT
A novel deep architecture, the tensor deep stacking network (T-DSN), is presented. The T-DSN consists of multiple, stacked blocks, where each block contains a bilinear mapping from two hidden layers to the output layer, using a weight tensor to incorporate higher order statistics of the hidden binary ($([0,1])$) features. A learning algorithm for the T-DSN's weight matrices and tensors is developed and described in which the main parameter estimation burden is shifted to a convex subproblem with a closed-form solution. Using an efficient and scalable parallel implementation for CPU clusters, we train sets of T-DSNs in three popular tasks in increasing order of the data size: handwritten digit recognition using MNIST (60k), isolated state/phone classification and continuous phone recognition using TIMIT (1.1 m), and isolated phone classification using WSJ0 (5.2 m). Experimental results in all three tasks demonstrate the effectiveness of the T-DSN and the associated learning methods in a consistent manner. In particular, a sufficient depth of the T-DSN, a symmetry in the two hidden layers structure in each T-DSN block, our model parameter learning algorithm, and a softmax layer on top of T-DSN are shown to have all contributed to the low error rates observed in the experiments for all three tasks.
INDEX TERMS
Computer architecture, Training, Tensile stress, Vectors, Stacking, Machine learning, Closed-form solutions, WSJ, Deep learning, stacking networks, tensor, bilinear models, handwriting image classification, phone classification and recognition, MNIST, TIMIT
CITATION
Brian Hutchinson, Li Deng, Dong Yu, "Tensor Deep Stacking Networks", IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.35, no. 8, pp. 1944-1957, Aug. 2013, doi:10.1109/TPAMI.2012.268
REFERENCES
 [1] L. Deng and D. Yu, "Deep Convex Networks: A Scalable Architecture for Speech Pattern Classification," Proc. Ann. Conf. Int'l Speech Comm. Assoc., Aug. 2011. [2] L. Deng and D. Yu, "Deep Convex Networks for Image and Speech Classification," Proc. ICML Workshop Learning Architectures, June 2011. [3] L. Deng, D. Yu, and J. Platt, "Scalable Stacking and Learning for Building Deep Architectures," Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing, 2012. [4] D.H. Wolpert, "Stacked Generalization," Neural Networks, vol. 5, no. 2, pp. 241-259, 1992. [5] G. Dahl, M. Ranzato, A. Mohamed, and G. Hinton, "Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine," Proc. Advances in Neural Information Processing Systems Conf., Dec. 2010. [6] M. Ranzato, A. Krizhevsky, and G. Hinton, "Factored 3-Way Restricted Boltzmann Machines for Modeling Natural Images," Proc. Int'l Conf. Artificial Intelligence and Statistics, vol. 13, 2010. [7] M. Ranzato and G. Hinton, "Modeling Pixel Means and Covariances Using Factorized Third-Order Boltzmann Machines," Proc. IEEE Conf Computer Vision and Pattern Recognition, pp. 2551-2558, 2010. [8] B. Hutchinson, L. Deng, and D. Yu, "A Deep Architecture with Bilinear Modeling of Hidden Representations: Applications to Phonetic Recognition," Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing, 2012. [9] G. Hinton and R. Salakhutdinov, "Reducing the Dimensionality of Data with Neural Networks," Science, vol. 313, no. 5768, pp. 504-507, 2006. [10] G. Dahl, D. Yu, L. Deng, and A. Acero, "Context-Dependent Pre-Trained Deep Neural Networks for Large Vocabulary Speech Recognition," IEEE Trans. Audio, Speech, and Language Processing, vol. 20, no. 1, pp. 30-42, Jan. 2012. [11] A. Mohamed, G. Dahl, and G. Hinton, "Acoustic Modeling Using Deep Belief Networks," IEEE Trans. Audio, Speech, and Language Processing, vol. 20, no. 1, pp. 14-22, Jan. 2012. [12] Q. Le, M. Ranzato, R. Monga, M. Devin, G. Corrado, K. Chen, J. Dean, and A. Ng, "Building High-Level Features Using Large Scale Unsupervised Learning," Proc. Int'l Conf. Machine Learning, 2012. [13] Y. Bengio, "Learning Deep Architectures for AI," Foundations and Trends in Machine Learning, vol. 2, no. 1, pp. 1-127, 2009. [14] D. Yu and L. Deng, "Accelerated Parallelizable Neural Network Learning Algorithm for Speech Recognition," Proc. 12th Ann. Conf. Int'l Speech Comm. Assoc., Aug. 2011. [15] E. Weisstein, "Symmetric Bilinear Form," MathWorld and Wikipedia, 2012. [16] T.G. Kolda and B.W. Bader, "Tensor Decompositions and Applications," SIAM Rev., vol. 51, no. 3, pp. 455-500, Sept. 2009. [17] D.M. Dunlavy, T.G. Kolda, and E. Acar, "Poblano v1.0: A Matlab Toolbox for Gradient-Based Optimization," Technical Report SAND2010-1422, Sandia Nat'l Laboratories, Albuquerque, N.M., and Livermore, Calif., Mar. 2010. [18] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-Based Learning Applied to Document Recognition," Proc. IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998. [19] D. Ciresan, U. Meier, J. Masci, L. Gambardella, and J. Schmidhuber, "Flexible, High Performance Convolutional Neural Networks for Image Classification," Proc. 22nd Int'l Joint Conf. Artificial Intelligence, 2011. [20] A. Mohamed, D. Yu, and L. Deng, "Investigation of Full-Sequence Training of Deep Belief Networks for Speech Recognition," Proc. Ann. Conf. Int'l Speech Comm. Assoc., Sept. 2010. [21] A. Mohamed, G. Dahl, and G. Hinton, "Deep Belief Networks for Phone Recognition," Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications, Dec. 2009. [22] K.F. Lee and H.W. Hon, "Speaker-Independent Phone Recognition Using Hidden Markov Models," IEEE Trans. Audio, Speech, and Language Processing, vol. 37, no. 11, pp. 1641-1648, Nov. 1989. [23] G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury, "Deep Neural Networks for Acoustic Modeling in Speech Recognition," IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82-97, Nov. 2012. [24] L. Deng, D. Yu, and A. Acero, "Structured Speech Modeling," IEEE Trans. Audio, Speech, and Langauge Processing, vol. 14, no. 5, pp. 1492-1504, Sept. 2006. [25] D.B. Paul and J.M. Baker, "The Design for the Wall Street Journal-Based CRS Corpus," Proc. Int'l Conf. Spoken Language Processing, 1992. [26] D. Yu, L. Deng, and F. Seide, "Large Vocabulary Speech Recognition Using Deep Tensor Neural Networks," Proc. Ann. Conf. Int'l Speech Comm. Assoc., 2012. [27] G. Tur, L. Deng, D. Hakkani-Tur, and X. He, "Toward Deeper Understanding: Deep Convex Networks for Semantic Utterance Classification," Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing, 2012. [28] D. Yu, S. Siniscalchi, L. Deng, and C. Lee, "Boosting Attribute and Phone Estimation Accuracies with Deep Neural Networks for Detection-Based Speech Recognition," Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing, 2012. [29] K.B. Petersen and M.S. Pedersen, "The Matrix Cookbook," http: matrixbook.com, 2008.