^{1}], [

^{2}]. It was called the deep convex network since learning the upper layer weights of each block could be formulated as solving a convex optimization problem with a closed-form solution, after having initialized the lower layer weights of each block with a fixed restricted Boltzmann machine. The network was later renamed the deep stacking network (DSN) [

^{3}], emphasizing that the mechanism in this network for establishing the deep architecture shares the same philosophy as “stacked generalization” [

^{4}]. This name also recognizes that the lower layer weights are, in practice, learned for greater effectiveness in classification tasks, so the overall weight learning problem in the DSN is no longer convex. In Section 2.1, we provide a short review of the previous DSN as the background for the current work.

^{5}], [

^{6}], [

^{7}]. One key distinction is the different domains in which the higher order structure is represented: one in the visible data, as in the mcRBM, and another in the hidden units, as in our T-DSN. In addition, mcRBM can only be used in one single bottom layer in deep architectures and cannot be easily extended to deeper layers. This is due to the model and learning complexity that are incurred by the factorization required to reduce the cubic growth in the size of the weight parameters. Factorization incurs very high computational cost which, together with the high cost of hybrid Monte Carlo in learning, makes it impossible to scale up to very large datasets. These difficulties are removed in the proposed T-DSN presented in this paper. Specifically, the same interleaving nature of linear and nonlinear layers inherited from DSN makes it straightforward to stack up deeper layers, and the closed-form solution for the upper layer weights enables efficient, parallel training. Because of the relatively small sizes in the hidden layers, no factorization is needed for the T-DSN's tensor weights. The mcRBM and T-DSN differ in other ways; in particular, the mcRBM is a generative model optimizing a maximum likelihood objective, while the T-DSN is a discriminative model optimizing a least-squares objective. The preliminary work that introduced the T-DSN and its key advantages was described previously in [

^{8}]. This paper significantly expands the work and contains comprehensive experimental results plus details of the learning algorithm and its implementation.

^{9}], [

^{10}], [

^{11}] which have achieved high success in large vocabulary speech recognition (e.g., [

^{10}]). In [

^{1}], [

^{2}], and [

^{3}], it was shown that all computational steps of the learning algorithm for DSN are batch-mode based, and are thus amenable to parallel implementation on a cluster of CPU (and/or GPU) nodes. The same computational advantage is retained for the T-DSN architecture introduced in this paper: We are able to parallelize all computations necessary for training and evaluation and thus scale our experiments to larger training sets using a cluster. The ability to continue to benefit from increasingly large batch sizes leads the T-DSN training to use parallelism in a very different way from that of [

^{12}], which in contrast distributes asynchronous computation of mini-batch gradients for training a deep sparse autoencoder. Unlike the DNN and other deep architectures that demand GPUs in learning, all results presented in this paper are obtained using exclusively CPU-based clusters.

^{1}]. It is trained in a supervised, block-wise fashion, without the need for back propagation over all blocks, as is common in other popular deep architectures [

^{13}]. The DSN blocks, each consisting of a simple, easy-to-learn module, are stacked to form the overall deep network.

^{1}] and [

^{2}] and which also forms the basis of the T-DSN, is a simplified multilayer perceptron with a single hidden layer. It consists of an upper layer weight matrix that connects the logistic sigmoidal nonlinear hidden layer to the linear output layer , and a lower layer weight matrix that links the input and hidden layers. Let the target vectors be arranged to form the columns of , let the input data vectors be arranged to form the columns of , let denote the matrix of hidden units, and assume the lower layer weights are known. The function performs the element-wise logistic sigmoid operation . Then, learning the upper layer weight matrix can be formulated as a convex optimization problem

(1)

(2)

^{14}] algorithm to minimize the squared error objective in (1). Embedding the solution of (2) into the objective and deriving the gradient, we obtain

(3)

(4)

^{9}]), the DSN does not aim to discover transformed feature representations. Due to this restrictive nature of building hierarchical structures as well as to the simplicity of each block, the core of the DSN is considerably simplified and optimizing network weights is naturally parallelizable. It is noteworthy that for purely discriminative tasks experiments have shown that DSN, despite its simplicity, performs better than the deep belief network [

^{3}].

^{1}]. Unlike the DSN, however, each block of the T-DSN has two sets of lower layer weight matrices and . They connect the input layer with two parallel branches of sigmoidal hidden layers and , shown in Fig. 1. Each T-DSN block also contains a three-way, upper layer weight tensor that connects the two branches of the hidden layer with the output layer.

*bilinearly*to produce the predictions. A map is bilinear if it is linear in for every fixed and linear in for every fixed [

^{15}]. We will see that the T-DSN generalizes the single block structure, replacing a linear map from hidden representation to output with a bilinear mapping while retaining its desirable modeling and estimation properties. That is, we have a generalization from the mapping of in the DSN to the mapping of in the T-DSN.

(5)

^{16}]. In more common notation:

(6)

(7)

(8)

*unfolded*into a matrix. Aggregating the implicit hidden representations for each of the training data points into the columns of an matrix , we obtain

(9)

^{16}], which performs a column-wise Kronecker product.

(10)

(11)

(12)

(13)

(14)

(15)

(16)

(17)

^{17}]. Our experience suggests that training a T-DSN block requires around 10-15 iterations of L-BFGS, with up to seven gradient evaluations per iteration for line search. In our experiments, the weight matrices and are initialized with random values in a range that is tuned using the validation set.

(18)

(19)

(20)

Table 1. Gradient Computational Complexity, Assuming the Earlier Expressions Are Cached for Use in the Later Expressions

(21)

(22)

(23)

Table 2. The Variables Computed in the Parallel Pipeline, Their Dimension, and the Mathematical Operations Required to Produce Them

**4.3.1 Parallel Timings**Our original and primary motivation for the parallel implementation was to allow us to scale beyond the memory limit of a single machine. We also measure the effect of parallelization on speed. As usual, there is cost associated with parallelization, namely, the interprocess communication time. Because our implementation uses network disk to store and load cached variables, this cost is nontrivial. Fig. 4 measures empirical wall-clock runtimes over repeated single instances of the parallel pipeline (i.e., each computing the gradient and evaluating the objective) on the TIMIT data (1.12m training samples). Specifically, the mean and the upper and lower 95 percent confidence intervals are plotted. To produce the timing results in this figure, we use hidden representations of dimension , and repeat a single instance of the parallel pipeline eight times over the number of machines, , across which the parallel training is distributed. On each machine processing is parallelized over eight cores. A fixed, additional overhead associated with initializing the data is also included in the presented times; in practice, these overhead costs would become negligible over the course of training a multiple-block T-DSN. The minimum value for is four since lower values caused the compute nodes' memory to be exceeded. For this dataset, the lowest average times are achieved in the range between and (80-200 cores). After this, there is a gradual rise in the total computation times as improvements in computation time are outpaced by the additional disk access and communication costs. In practice, because the speedup is relatively insensitive to the degree of parallelization, we simply set to be sufficiently large that training does not exceed the compute nodes' memory limits. Note that the stacking nature of the T-DSN means that one cannot parallelize over blocks, only within blocks. Because, in practice, the number of blocks is limited and the bulk of the computation is spent within blocks, this does not prove to be a major obstacle to training.

^{18}]. The digits have been size-normalized while preserving their aspect ratio. Each original image is centered by computing and translating the center of mass of the pixels, yielding a image. The task is to classify each image into one of the 10 digits. The MNIST training set is composed of 60,000 examples from approximately 250 writers. The test set is composed of 10,000 patterns. The sets of writers of the training set and test set are disjoint. In the experiments, a small fraction of the training data is held out as a validation set to tune hyperparameters in the T-DSN. The properties of the validation and test sets in MNIST are found to be very similar to each other.

Table 3. MNIST: Test-Set Error Rate at Convergence as a Function of Hidden-Layer Configuration

^{19}]. The use of distortions to augment the training data is important to achieve this lowest error rate. Without the use of the distortions, which is impractical in real applications, the error rate was increased to 0.53 percent. Without the use of convolutional structure and distortions or any other types of special preprocessing, the lowest error rate, 0.83 percent, was reported in [

^{3}] by a carefully tuned and optimized DSN. The error rate of 1.21 percent reported in this paper is comparable to that achieved using the deep belief network as reported in [

^{9}]. It was obtained without careful tuning, without passing of the learned weights from one block to another, and without pretraining of weights using restricted Boltzmann machines, which were all exploited in [

^{3}].

^{20}]) and has not been customized for the T-DSN in this study. For the prediction at each layer of the T-DSN, we use 183 target class labels (i.e., three states for each of the 61 phones), which we call “phone states,” with a zero-one (also called one-hot) encoding scheme. Phone boundaries are labeled in the corpus; we obtain the phone state labels by a state-frame alignment using a strong GMM-HMM system, as is common for recent deep learning work for speech recognition.

^{20}], [

^{21}] of DNN, and show that the T-DSN and DSN are both superior to DNN in both error rate and cross entropy.

Table 4. TIMIT: Comparing T-DSN (and DSN) (a) Before and (b) After Adding Softmax Layers to Produce Posterior Probabilities, in Terms of Frame-Level State Error Rate (F-SER) and CE Value

^{22}]. Using this new measure after collapsing, we observe a lower error rate using T-DSN than DSN and two versions of DNN. When the outputs of T-DSN are fed further to a five-hidden-layer DNN, denoted as “T-DSN + DNN” in Table 5, the framewise phone error rate is dropped further.

Table 5. TIMIT: Comparing Two Versions of T-DSN, DSN, and DNN in Terms of Frame-Level Phone Error Rate and of Continuous Phone Recognition Error Rate

^{20}] . The state-of-the-art TIMIT phone recognition error rate is 20.1 percent, reported and analyzed very recently in [

^{23}], lower than 22.8 percent reported here. The differences are due to 1) the use of specially designed convolutional structure, 2) the use of filterband instead of MFCC features, and 3) very expensive optimization using backpropagation. The first two of these differences can be incorporated into the T-DSN architecture (as future work).

^{25}]. As suggested by the name, the 5k-WSJ0 database uses a 5,000 word vocabulary. The training material from the SI84 set (7,077 utterances, or 15.3 hours of speech from 84 speakers) in the database is separated into a 6,877-utterance training set and a 200-sentence validation set. Evaluation was carried out on the Nov92 evaluation data with 330 utterances from eight speakers. In this paper, we use the same MFCCs and their deltas as in the TIMIT experiments for the short-time spectral representation of the speech signal. With the 10 millisecond frame rate, this database gives over 5-million frames (i.e., samples) in the training data (5,232,244 to be exact), substantially larger than MNIST and TIMIT. Further, unlike the TIMIT database where the phone boundaries in training data are provided by human annotators, no phone boundaries are given in WSJ0. In this paper, we generate the phone labels and their boundaries in the training data from the forced alignments using a tied-state crossword triphone Gaussian-mixture-HMM speech recognizer. Test-set labels are produced in the same way. These phone labels, with a total of 40 of them, together with their boundaries provide one-to-one mapping between each speech frame with its phone label as the target for training the T-DSN.

Table 6. WSJ: Frame-Level Phone Classification Error Rate Achieved with Only One Block and with Random Weight Matrices and in Fig. 1, as a Function of Hidden-Layer Configuration and of the Input Window Size

Table 7. WSJ: Frame-Level Phone Classification Error Rate, *After* Stacking Five Blocks and Training Weight Matrices and for Each Block, as a Function of Hidden-Layer Configuration and of the Input Window Size

^{26}] over the DNN) in its potential to explicitly incorporate speaker and/or environmental factors by training one of the hidden representations to encode speaker or environmental information while effectively gating the other hidden-to-output mapping. Moreover, the T-DSN is equipped with the new stacking mechanism where the more compact dual hidden representations can be concatenated with the input data in stacking the T-DSN blocks. The significantly smaller hidden representation size in the T-DSN than DSN has the effect of bottle-necking the data, aiding “stackability” in the deep architecture by providing flexibility in stacking choices. One can concatenate the raw input data with and instead of the output , which may potentially be very large in some applications. The bottle-necking effect would permit the T-DSN to pass more information between the blocks without dramatically increasing the input dimension in the higher level blocks.

^{27}] and in speech attribute detection [

^{28}], we expect greater success with the use of T-DSN in these and other applications.

# Acknowledgments

*B. Hutchinson is with the Department of Electrical Engineering, University of Washington, Seattle, WA 98105.*

*E-mail: brianhutchinson@ee.washington.edu.*

*L. Deng and D. Yu are with Microsoft Research, Redmond, WA 98052-6399. E-mail: {deng, dongyu}@microsoft.com.*

*Manuscript received 13 Apr. 2012; revised 22 Sept. 2012; accepted 30 Nov. 2012; published online 19 Dec. 2012.*

*Recommended for acceptance by M. Welling.*

*For information on obtaining reprints of this article, please send e-mail to: tpami@computer.org, and reference IEEECS Log Number TPAMI-2012-04-0290.*

*Digital Object Identifier no. 10.1109/TPAMI.2012.268.*

#### References

**Brian Hutchinson**received the BA degree in linguistics and the BS and MS degrees in computer science, all from Western Washington University, and the master's degree in electrical engineering from the University of Washington, where he currently working toward the PhD degree. He worked at Microsoft Research, Redmond, as a PhD intern in 2010 and 2011. His research interests include speech and language processing, optimization, pattern classification, and machine learning. In particular, he is interested in matrix and tensor rank minimization in the context of language processing. He is a student member of the IEEE.

**Li Deng**received the PhD degree from the University of Wisconsin Madison. He joined the Department of Electrical and Computer Engineering, University of Waterloo, Ontario, Canada, in 1989, as an assistant professor, where he became a full professor with tenure in 1996. In 1999, he joined Microsoft Research (MSR), Redmond, Washington, as a senior researcher, where he is currently a principal researcher. Since 2000, he has also been an affiliate full professor and graduate committee member in the Department of Electrical Engineering at the University of Washington, Seattle. Prior to MSR, he also worked or taught at the Massachusetts Institute of Technology, ATR Interpreting Telecommunications Research Laboratories, Kyoto, Japan, and the Hong Kong University of Science and Technology. In broad areas of speech/language processing, signal processing, and machine learning, he has published more than 300 refereed papers in leading journals and conferences and three books. His recent technical work (since 2009) on industry-scale deep learning with colleagues and collaborators has created significant impact on speech recognition, signal processing, and related applications with high practical value. He has been granted more than 60 US or international patents. He served on the Board of Governors of the IEEE Signal Processing Society (2008-2010). More recently, he served as an editor-in-chief for the

*IEEE Signal Processing Magazine*(2009-2011), which ranked first in both years among all publications within the Electrical and Electronics Engineering Category worldwide in terms of its impact factor and for which he received the 2011 IEEE SPS Meritorious Service Award. He currently serves as editor-in-chief of the

*IEEE Transactions on Audio, Speech and Language Processing*. He is a fellow of the IEEE, the Acoustical Society of America, and ISCA.

**Dong Yu**received the BS degree (with honors) in electrical engineering from Zhejiang University, China, the MS degree in computer science from Indiana University at Bloomington, the MS degree in electrical engineering from the Chinese Academy of Sciences, and the PhD degree in computer science from the University of Idaho. He joined Microsoft Corporation in 1998 and Microsoft Speech Research Group in 2002, where he currently is a senior researcher. His current research interests include speech processing, robust speech recognition, discriminative training, spoken dialog system, machine learning, and pattern recognition. He has published more than 100 papers in these areas and is the inventor/co-inventor of close to 50 granted/pending patents. He is currently serving as an associate editor of the

*IEEE Transactions on Audio, Speech, and Language Processing*(2011-) and has served as an associate editor of the

*IEEE Signal Processing Magazine*(2008-2011), and as the lead guest editor of the

*IEEE Transactions on Audio, Speech, and Language Processing*special issue on deep learning for speech and language processing (2010-2011). He is a senior member of the IEEE.

| |||