, IEEE

, IEEE

Pages: pp. 1944-1957

Abstract—A novel deep architecture, the tensor deep stacking network (T-DSN), is presented. The T-DSN consists of multiple, stacked blocks, where each block contains a bilinear mapping from two hidden layers to the output layer, using a weight tensor to incorporate higher order statistics of the hidden binary ( $[0,1]$ ) features. A learning algorithm for the T-DSN's weight matrices and tensors is developed and described in which the main parameter estimation burden is shifted to a convex subproblem with a closed-form solution. Using an efficient and scalable parallel implementation for CPU clusters, we train sets of T-DSNs in three popular tasks in increasing order of the data size: handwritten digit recognition using MNIST (60k), isolated state/phone classification and continuous phone recognition using TIMIT (1.1 m), and isolated phone classification using WSJ0 (5.2 m). Experimental results in all three tasks demonstrate the effectiveness of the T-DSN and the associated learning methods in a consistent manner. In particular, a sufficient depth of the T-DSN, a symmetry in the two hidden layers structure in each T-DSN block, our model parameter learning algorithm, and a softmax layer on top of T-DSN are shown to have all contributed to the low error rates observed in the experiments for all three tasks.

Keywords—Deep learning; stacking networks; tensor; bilinear models; handwriting image classification; phone classification and recognition; MNIST; TIMIT; WSJ

Recently, a deep classification architecture built upon blocks of simplified neural network modules, each with a single nonlinear hidden layer and linear input and output layers was proposed, developed, and evaluated [ ^{1}], [ ^{2}]. It was called the deep convex network since learning the upper layer weights of each block could be formulated as solving a convex optimization problem with a closed-form solution, after having initialized the lower layer weights of each block with a fixed restricted Boltzmann machine. The network was later renamed the deep stacking network (DSN) [ ^{3}], emphasizing that the mechanism in this network for establishing the deep architecture shares the same philosophy as “stacked generalization” [ ^{4}]. This name also recognizes that the lower layer weights are, in practice, learned for greater effectiveness in classification tasks, so the overall weight learning problem in the DSN is no longer convex. In Section 2.1, we provide a short review of the previous DSN as the background for the current work.

The new deep architecture presented in this paper, which we call the tensor deep stacking network (T-DSN), improves and extends the earlier DSN architecture in two significant ways. First, information about higher order, covariance statistics in the data, which was not represented in DSN, is now embedded into the T-DSN via a bilinear mapping from two hidden representations to predictions using a third-order tensor. Second, while the T-DSN retains the same linear-nonlinear interleaving structure as DSN in building up the deep architecture, it shifts the major learning problem from the lower layer, nonconvex optimization component to the upper layer, convex subproblem with a closed-form solution.

Modeling covariance structure directly on raw speech or image data, rather than on the more compact binary ( $[0,1]$ ) hidden feature layers as achieved in T-DSN, was previously proposed in an architecture called mcRBM [ ^{5}], [ ^{6}], [ ^{7}]. One key distinction is the different domains in which the higher order structure is represented: one in the visible data, as in the mcRBM, and another in the hidden units, as in our T-DSN. In addition, mcRBM can only be used in one single bottom layer in deep architectures and cannot be easily extended to deeper layers. This is due to the model and learning complexity that are incurred by the factorization required to reduce the cubic growth in the size of the weight parameters. Factorization incurs very high computational cost which, together with the high cost of hybrid Monte Carlo in learning, makes it impossible to scale up to very large datasets. These difficulties are removed in the proposed T-DSN presented in this paper. Specifically, the same interleaving nature of linear and nonlinear layers inherited from DSN makes it straightforward to stack up deeper layers, and the closed-form solution for the upper layer weights enables efficient, parallel training. Because of the relatively small sizes in the hidden layers, no factorization is needed for the T-DSN's tensor weights. The mcRBM and T-DSN differ in other ways; in particular, the mcRBM is a generative model optimizing a maximum likelihood objective, while the T-DSN is a discriminative model optimizing a least-squares objective. The preliminary work that introduced the T-DSN and its key advantages was described previously in [ ^{8}]. This paper significantly expands the work and contains comprehensive experimental results plus details of the learning algorithm and its implementation.

One major motivation for developing the recent DSN is the lack of scalability and parallelization in the learning algorithms for the deep neural network (DNN) [ ^{9}], [ ^{10}], [ ^{11}] which have achieved high success in large vocabulary speech recognition (e.g., [ ^{10}]). In [ ^{1}], [ ^{2}], and [ ^{3}], it was shown that all computational steps of the learning algorithm for DSN are batch-mode based, and are thus amenable to parallel implementation on a cluster of CPU (and/or GPU) nodes. The same computational advantage is retained for the T-DSN architecture introduced in this paper: We are able to parallelize all computations necessary for training and evaluation and thus scale our experiments to larger training sets using a cluster. The ability to continue to benefit from increasingly large batch sizes leads the T-DSN training to use parallelism in a very different way from that of [ ^{12}], which in contrast distributes asynchronous computation of mini-batch gradients for training a deep sparse autoencoder. Unlike the DNN and other deep architectures that demand GPUs in learning, all results presented in this paper are obtained using exclusively CPU-based clusters.

The organization of this paper is as follows: In Section 2, we provide a brief review of the DSN and discuss how the T-DSN generalizes the DSN. Section 2 also presents an overview of the T-DSN architecture and its bilinear structure that uses tensor weights to make predictions from two hidden representations of the input data. In Section 3, we describe a solution, including its derivation, to the T-DSN learning problem from the algorithmic perspective. A parallel implementation of the learning algorithm for practical applications is detailed in Section 4, which for comparison also includes a computational analysis of sequential learning time complexity. Section 5 presents three sets of evaluation experiments, from a smaller scale (MNIST with 60k training samples for image classification) to a larger scale (TIMIT with 1.12m training samples for both isolated phone classification and continuous phone recognition) and to a still larger scale (WSJ with 5.23 m training samples for phone classification). We show the experimental results that consistently demonstrate the effectiveness of the T-DSN architecture and related learning methods.

In this section, we first briefly review the DSN as it relates to the T-DSN, and then describe the general architecture of the T-DSN and its key properties.

The DSN is a scalable deep architecture amenable to parallel weight learning [ ^{1}]. It is trained in a supervised, block-wise fashion, without the need for back propagation over all blocks, as is common in other popular deep architectures [ ^{13}]. The DSN blocks, each consisting of a simple, easy-to-learn module, are stacked to form the overall deep network.

Each DSN block, as developed in [ ^{1}] and [ ^{2}] and which also forms the basis of the T-DSN, is a simplified multilayer perceptron with a single hidden layer. It consists of an upper layer weight matrix ${\bf U}$ that connects the logistic sigmoidal nonlinear hidden layer ${\bf h}$ to the linear output layer ${\bf y}$ , and a lower layer weight matrix ${\bf W}$ that links the input and hidden layers. Let the target vectors ${\bf t}$ be arranged to form the columns of ${\bf T}$ , let the input data vectors $\bf x$ be arranged to form the columns of ${\bf X}$ , let ${\bf H} = \sigma ({\bf W}^T {\bf X})$ denote the matrix of hidden units, and assume the lower layer weights ${\bf W}$ are known. The function $\sigma$ performs the element-wise logistic sigmoid operation $\sigma (x) = 1/(1+\exp (-x))$ . Then, learning the upper layer weight matrix ${\bf U}$ can be formulated as a convex optimization problem

(1)

which has a closed-form solution

$${\bf U}^T = {\bf T}{\bf H}^\dagger, \quad {\rm where } \quad {\bf H}^\dagger = {\bf H}^T ({\bf H}{\bf H}^T)^{-1}.$$(2)

At the bottom block, ${\bf X}$ contains only the raw input data, but for higher blocks of the DSN (as well as the T-DSN), the input data are concatenated with one or more output representations (typically $y$ ) from the previous blocks. The lower layer weight matrix ${\bf W}$ in a DSN block can be optimized using an accelerated gradient descent [ ^{14}] algorithm to minimize the squared error objective in (1). Embedding the solution of (2) into the objective and deriving the gradient, we obtain

(3)

where ${\bf 1}$ is the matrix of all ones, $\circ$ denotes element-wise multiplication, and

$${\bf \Theta } = 2 {\bf H}^\dagger ({\bf HT}^T)({\bf TH}^\dagger ) - {\bf T}^T({\bf TH}^\dagger ).$$(4)

To train a DSN block one iteratively updates ${\bf W}$ using the gradient in (3), which by design takes into consideration the optimal ${\bf U}$ ; after ${\bf W}$ has been estimated, the closed-form ${\bf U}$ is computed once, explicitly.

It is to be emphasized that a key element of the standard DSN is that each block outputs an estimate of the final label class (expressed as the vector ${\bf y}$ ) and this estimate is concatenated with the original input vector to form the expanded “input” vector for the next block of the DSN. Because the original input is retained for each higher block, it is guaranteed to perform better on the training set than the previous block. In contrast to other deep architectures (e.g., the deep belief network [ ^{9}]), the DSN does not aim to discover transformed feature representations. Due to this restrictive nature of building hierarchical structures as well as to the simplicity of each block, the core of the DSN is considerably simplified and optimizing network weights is naturally parallelizable. It is noteworthy that for purely discriminative tasks experiments have shown that DSN, despite its simplicity, performs better than the deep belief network [ ^{3}].

It is also to be clarified that the estimate of the label class, vector ${\bf y}$ , is continuous valued, computed as a linear combination of the hidden layer in each block. The classification decision is made at the top block where the index of the maximal value in vector ${\bf y}$ determines the label. At the lower blocks of the DSN, the output vector ${\bf y}$ is not used for decision but is used to concatenate with the original input vector to feed to its immediately upper block. All these attributes have been inherited by the T-DSN to be presented next.

The DSN just reviewed is a special case of the T-DSN we describe now. In Fig. 1, we illustrate the modular architecture of a T-DSN, where three complete blocks are stacked one on another. The stacking operation of the T-DSN is exactly the same as that for the DSN described in [ ^{1}]. Unlike the DSN, however, each block of the T-DSN has two sets of lower layer weight matrices ${\bf W}_{(1)}$ and ${\bf W}_{(2)}$ . They connect the input layer ${\bf X}$ with two parallel branches of sigmoidal hidden layers ${\bf H}_{(1)}$ and ${\bf H}_{(2)}$ , shown in Fig. 1. Each T-DSN block also contains a three-way, upper layer weight tensor ${\cal U}$ that connects the two branches of the hidden layer with the output layer.

Figure Fig. 1. An example T-DSN architecture with three stacking blocks, where each block consists of three layers and superscript is used to indicate the block number. Inputs ( ${\bf X}$ ) and outputs ( ${\bf Y}^{(i-1)}$ ) are concatenated to link two adjacent blocks. The hidden layer in each block has two parallel branches ( ${\bf H}_{(1)}^{(i)}$ and ${\bf H}_{(2)}^{(i)}$ ).

Note that if the T-DSN is used for regression or for classification, then the basic architecture shown in Fig. 1 is sufficient. However, if the T-DSN is to be interfaced with a hidden Markov model (HMM) for structured prediction, such as continuous phonetic or word recognition, it is desirable to convert the final output in Fig. 1 into posterior probabilities via an additional softmax layer (the softmax operation applied to a vector exponentiates each entry and then normalizes the vector's entries to sum to one, yielding a distribution). One set of the experiments reported in Section 5.2 is obtained with an additional softmax layer added to the top of Fig. 1.

Whereas each block of the DSN produces a single hidden representation of the data and linearly maps from the hidden representation to predictions, each block of the T-DSN uses two hidden representations and combines them *bilinearly* to produce the predictions. A map ${\cal F}(u,v)$ is bilinear if it is linear in $u$ for every fixed $v$ and linear in $v$ for every fixed $u$ [ ^{15}]. We will see that the T-DSN generalizes the single block structure, replacing a linear map from hidden representation to output with a bilinear mapping while retaining its desirable modeling and estimation properties. That is, we have a generalization from the mapping of ${\hbox{\rlap{I}\kern 2.0pt{\hbox{R}}}}^{L} \rightarrow {\hbox{\rlap{I}\kern 2.0pt{\hbox{R}}}}^C$ in the DSN to the mapping of ${\hbox{\rlap{I}\kern 2.0pt{\hbox{R}}}}^{L_1} \times {\hbox{\rlap{I}\kern 2.0pt{\hbox{R}}}}^{L_2} \rightarrow {\hbox{\rlap{I}\kern 2.0pt{\hbox{R}}}}^C$ in the T-DSN.

As illustrated in Fig. 1, the first step in the T-DSN operation is to map an input data vector ${\bf x} \in {\hbox{\rlap{I}\kern 2.0pt{\hbox{R}}}}^D$ to two parallel branches of hidden representations, ${\bf h}_{(1)} \in {\hbox{\rlap{I}\kern 2.0pt{\hbox{R}}}}^{L_1}$ and ${\bf h}_{(2)} \in {\hbox{\rlap{I}\kern 2.0pt{\hbox{R}}}}^{L_2}$ . Conceptually, these represent two different views of the data (see Appendix C for an analysis of these views). Each hidden representation is obtained nonlinearly from the input data according to ${\bf h}_{(j)} = \sigma ( {\bf W}_{(j)}^T {\bf x})$ , where ${\bf W}_{(1)} \in {\hbox{\rlap{I}\kern 2.0pt{\hbox{R}}}}^{D \times L_1}$ and ${\bf W}_{(2)} \in {\hbox{\rlap{I}\kern 2.0pt{\hbox{R}}}}^{D \times L_2}$ are two weight matrices to be estimated. The interactions between these two hidden-layer branches and the prediction vector ${\bf y}$ are modeled by a third-order weight tensor, ${\bf {\cal U}} \in {\hbox{\rlap{I}\kern 2.0pt{\hbox{R}}}}^{L_1 \times L_2 \times C}$ . Specifically, in the second step, the two hidden representations are bilinearly mapped to the prediction vector via ${\bf {\cal U}}$ . In tensor notation, the operation is

$${\cal U}( {\bf h}_{(1)}, {\bf h}_{(2)} ) \buildrel{\triangle}\over{=} ({\bf {\cal U}} \times_1 {\bf h}_{(1)}) \times_2 {\bf h}_{(2)} = {\bf y},$$(5)

where $\times_i$ denotes multiplying along the $i$ th dimension (mode) of the tensor [ ^{16}]. In more common notation:

(6)

where ${\bf U}_k \in {\hbox{\rlap{I}\kern 2.0pt{\hbox{R}}}}^{L_1 \times L_2}$ denotes the (matrix) slice of ${\cal U}$ obtained by fixing the third index to $k$ and allowing the first two indices to vary.

It is instructive to link the T-DSN's behavior to that of the DSN, which we can accomplish by manipulating the bilinear notation above. First, define ${\bf \tilde{h}}$ to be ${\bf \tilde{h}} = {\bf h}_{(1)} \;{\otimes}$${\bf h}_{(2)} \in {\hbox{\rlap{I}\kern 2.0pt{\hbox{R}}}}^{L_1 L_2}$ , where $\otimes$ denotes the Kronecker product: The $((i-1)L_2+j)$ th element of ${\bf h}_{(1)} \otimes {\bf h}_{(2)}$ is equal to $h_{(1)i} h_{(2)j}$ , for $i \in \{1, 2, \ldots, L_1 \}, j \in \{1, 2, \ldots, L_2 \}$ . By definition, ${\bf \tilde{h}}$ contains all pairs of products between elements in ${\bf h}_{(1)}$ and elements in ${\bf h}_{(2)}$ . One can then vectorize ${\bf U}_k$ to create ${\bf \tilde{u}}_k = \hbox{vec}({\bf U}_k) \in {\hbox{\rlap{I}\kern 2.0pt{\hbox{R}}}}^{L_1 L_2}$ using the ordering that matches ${\bf \tilde{h}}$ , that is, if the $\ell$ th element of ${\bf \tilde{h}}$ is ${\bf h}_{(1)i}{\bf h}_{(2)j}$ , then the $\ell$ th element of ${\bf \tilde{u}}_k$ is ${\bf {\cal U}}_{ijk}$ . Then:

$$y_k = \sum_{i=1}^{L_1} \sum_{j=1}^{L_2}{\cal U}_{ijk} h_{(1)i} h_{(2)j} = {\bf \tilde{u}}_k^T {\bf \tilde{h}}.$$(7)

Arranging all ${\bf \tilde{u}}_k$ , $k=1, 2,\ldots, C$ , into a matrix ${\bf \tilde{U}} \;{=}$$[ {\bf \tilde{u}}_1 {\bf \tilde{u}}_2 \ldots {\bf \tilde{u}}_C ]$ , the overall prediction then becomes

$${\bf y} = {\bf \tilde{U}}^T {\bf \tilde{h}}.$$(8)

Thus, the bilinear mapping from the two hidden-layer branches can be viewed as a linear mapping from a single, implicit hidden representation, ${\bf \tilde{h}}$ . The linear mapping uses matrix ${\bf \tilde{U}}$ , which contains all of the elements of the tensor ${\bf {\cal U}}$*unfolded* into a matrix. Aggregating the implicit hidden representations ${\bf \tilde{h}}$ for each of the $N$ training data points into the columns of an $L_1L_2\times N$ matrix ${\bf \tilde{H}}$ , we obtain

(9)

This leads to the same prediction equation as in the DSN, but with an implicit hidden representation ${\bf \tilde{h}}$ that contains pairwise multiplicative interactions between ${\bf h}_{(1)}$ and ${\bf h}_{(2)}$ , incorporating second-order statistics of the input data in a parsimonious manner. In Fig. 2, we present an equivalent architecture to the bottom block of Fig. 1, illustrating how the two hidden layers are expanded into an implicit hidden layer. The relationship between the matrices of explicit, lower dimensional hidden units, ${\bf H}_{(1)} = \sigma ( {\bf W}_{(1)}^T {\bf X} )$ and ${\bf H}_{(2)} = \sigma ( {\bf W}_{(2)}^T {\bf X} )$ , and the matrix of implicit hidden units, ${\bf \tilde{H}}$ , is

$${\bf \tilde{H}} = {\bf H}_{(1)} \odot {\bf H}_{(2)}.$$The $\odot$ operation is the Khatri-Rao product [ ^{16}], which performs a column-wise Kronecker product.

Figure Fig. 2. Equivalent architecture to the bottom block of Fig. 1, where the tensor is unfolded into a large matrix.

Because the architectures shown in Figs. 1 and 2 are equivalent, learning the second layer weights given the implicit hidden representation is the same least-squares problem encountered by the DSN. Specifically, the Tikhonov regularized optimization problem (with training target matrix ${\bf T}$ ),

$$\mathop{\rm min}_{\bf \tilde{U}^T} \Vert {\bf T} - {\bf \tilde{U}}^T {\bf \tilde{H}} \Vert^2_F + \lambda \Vert {\bf U} \Vert_F^2,$$(10)

has the closed-form solution of

$${\bf \tilde{U}}^T = {\bf T}{\bf \tilde{H}}^\ddagger = {\bf T}{\bf \tilde{H}}^T ({\bf \tilde{H}}{\bf \tilde{H}}^T + \lambda {\bf I})^{-1}.$$(11)

Substituting the constraint of (11) into the overall objective of (10), we can make the full learning more effective by coupling the lower weight matrices with the upper weight matrix. In other words, as with the DSN, the upper layer weights of a T-DSN block are deterministically computed given the lower layer weights and thus they do not need to be learned separately and independently. This contrasts with the standard neural network learning algorithms, where both sets of weights are updated incrementally with no direct constraints on each other. Such constraints are available for T-DSN due to the use of linear output layers in each block, and are not available for standard neural networks with nonlinear output layers.

We now address the more difficult learning problem for the two lower weight matrices ${\bf W}_{(1)}$ and ${\bf W}_{(2)}$ since ${\bf \tilde{H}}$ is a deterministic function of the lower layer weights. In this paper, we adopt the strategy of optimizing ${\bf W}_{(1)}$ and ${\bf W}_{(2)}$ using methods requiring only first-order oracle information, i.e., those requiring only objective function evaluations and the gradients of the objective function with respect to ${\bf W}_{(1)}$ and ${\bf W}_{(2)}$ . Note that this includes approximate second-order optimization methods such as L-BFGS. The derivation of these gradients has commonality to the DSN case, with an extra step necessitated by the Khatri-Rao product. To simplify the notation throughout the paper we assume $\lambda =0$ in (10) (a detailed derivation of the gradient for arbitrary $\lambda$ is provided in Appendix A).

With notation analogous to (4), let ${\bf \tilde{\Theta }}$ denote

$${\bf \tilde{\Theta }} = \nabla_{{\bf \tilde{H}}^T} f = 2 {\bf \tilde{H}^\dagger } ({\bf \tilde{H}T}^T)({\bf T\tilde{H}^\dagger }) - {\bf T}^T({\bf T\tilde{H}^\dagger }).$$(12)

By the chain rule, we obtain

$$\big[ \nabla_{\bf H_{(1)}} f \big]_{in} = \big\langle {\bf \tilde{\Theta }}^T, {\bf E}_{(i,n)}^{L_1 \times N} \odot {\bf H}_{(2)} \big\rangle$$(13)

$$= \sum_{k=1}^{L_2} H_{(2)kn} \tilde{\Theta }_{((i-1)L_2+k),n},$$(14)

$$\big[ \nabla_{\bf H_{(2)}} f \big]_{jn} = \big\langle {\bf \tilde{\Theta }}^T, {\bf H}_{(1)} \odot {\bf E}_{(j,n)}^{L_2 \times N} \big\rangle$$(15)

$$= \sum_{k=1}^{L_1} H_{(1)kn} \tilde{\Theta }_{((k-1)L_1+j),n},$$(16)

where ${\bf E}_{(i,j)}^{m \times n}$ denotes an $m \times n$ matrix with entry $(i,j)$ equal to one and all other entries zero. Let ${\bf \Psi }_{(1)} \in {\hbox{\rlap{I}\kern 2.0pt{\hbox{R}}}}^{L_1 \times N}$ and ${\bf \Psi }_{(2)} \in {\hbox{\rlap{I}\kern 2.0pt{\hbox{R}}}}^{L_2 \times N}$ denote the matrices $\nabla_{{\bf H}_{(1)}} f$ and $\nabla_{{\bf H}_{(2)}} f$ , respectively. Then, following the derivation of the DSN, we obtain

$$\nabla_{{\bf W}_{(1)}} f = {\bf X}({\bf H}^T_{(1)} \circ \big({\bf 1} - {\bf H^T}_{(1)}\big) \circ {\bf \Psi }_{(1)}\big)$$and

$$\nabla_{{\bf W}_{(2)}} f = {\bf X}\big({\bf H}^T_{(2)} \circ \big({\bf 1} - {\bf H^T}_{(2)}\big) \circ {\bf \Psi }_{(2)}\big).$$(17)

The ${\bf \Psi }_{(1)}$ and ${\bf \Psi }_{(2)}$ matrices above have the effect of bridging the high-dimensional representation used in ${\bf \tilde{\Theta }}$ with the low-dimensional representation in ${\bf H}_{(1)}$ and ${\bf H}_{(2)}$ , and are needed due to the Khatri-Rao product.

Using the above gradients one can train a T-DSN block using a number of algorithms; in our experiments, we use the L-BFGS and gradient descent implementations in the Poblano optimization toolbox [ ^{17}]. Our experience suggests that training a T-DSN block requires around 10-15 iterations of L-BFGS, with up to seven gradient evaluations per iteration for line search. In our experiments, the weight matrices ${\bf W}_{(1)}$ and ${\bf W}_{(2)}$ are initialized with random values in a range that is tuned using the validation set.

From (12) and (17), it is clear that the bulk of the gradient computation is in matrix operations, including matrix multiplications and element-wise matrix products. To bypass memory limitations and to speed up training, we parallelize these matrix operations to run on a CPU cluster. The ability to parallelize training in this manner is key for the scalability of T-DSN training.

We show here that the DSN is, in fact, a special case of the T-DSN. To distinguish this special case from the general case, we will use ${\bf \hat{H}}$ and ${\bf \hat{U}}$ to denote the T-DSN's ${\bf \tilde{H}}$ and ${\bf \tilde{U}}$ , respectively. As before, let ${\bf h}_{(1)}$ and ${\bf h}_{(2)}$ denote the two hidden representations in a T-DSN block and let ${\bf h}$ denote the (only) hidden representation in a DSN block. Let ${\bf W}_{(1)}$ , ${\bf W}_{(2)}$ , and ${\bf W}$ denote the corresponding first-layer weight matrices. A DSN is then a special case of a T-DSN when $L_1=1$ , $L_2=L$ , ${\bf W}_{(1)} = {\bf 0}$ , and ${\bf W}_{(2)} = {\bf W} \in {\hbox{\rlap{I}\kern 2.0pt{\hbox{R}}}}^{L \times N}$ . Then ${\bf h}_{(1)} = \sigma ({\bf 0}^T {\bf x}) = 1/2$ and ${\bf h}_{(2)} = {\bf h}$ , and it follows that

$${\bf \hat{H}} = {\bf H}_{(1)} \odot {\bf H}_{(2)} = {1\over 2}{\bf H}.$$(18)

In this case, the pseudo-inverse is

$${\bf \hat{H}}^\dagger = {\bf \hat{H}}^T ({\bf \hat{H}}{\bf \hat{H}}^T)^{-1} = {1\over 2}{\bf H}^T \left({1\over 4}{\bf H}{\bf H}^T\right)^{-1} = 2{\bf H}^\dagger,$$(19)

giving the least-squares solution ${\bf \hat{U}}^T = {\bf T}{\bf \hat{H}}^\dagger = 2 {\bf T}{\bf H}^\dagger$ . Hence, the T-DSN and DSN predictions are identical:

$${\bf \hat{U}}^T {\bf \hat{H}} = \left( 2 {\bf U}^T \right) \left( {1\over 2}{\bf H} \right) = {\bf U}^T {\bf H} = {\bf Y}.$$(20)

In Appendix B, we show that under these same conditions, gradients $\nabla_{{\bf W}_{(2)}}f$ and $\nabla_{\bf W}$ are identical as well.

While the DSN is a special, extremely asymmetric, case of the T-DSN, we have found that the closer the two hidden-layer branches' dimensions are, the better the classification performance (see Section 5.2 for empirical evidence). In the nondegenerate cases ( $L_1 \approx L_2 \gg 1$ ), the T-DSN dramatically shifts the balance of parameters from the lower layer weights to the upper layer weights. Whereas the DSN has $D \times L$ lower layer (explicit) parameters and $L \times C$ upper layer (implicit) parameters, the T-DSN has $D \times (L_1 + L_1)$ lower layer parameters and $L_1 \times L_2 \times C$ upper layer parameters. We conjecture that the observed advantages of using equal numbers of hidden units is because the symmetric case maximizes the ratio of implicit feature dimension $L_1 L_2$ over explicit feature dimension $L_1 + L_2$ , and thus makes the best use of the closed-form upper layer parameters. The key advantage of the nondegenerated T-DSN over the degenerated one (i.e., DSN) is the new ability to capture higher order feature interactions via the cross product.

Stochastic minibatch training is commonly employed in deep learning training. Empirically, researchers often observe diminishing returns in classification accuracy performance for gradient methods as the minibatch size increases. In contrast, due to the embedded least-squares problem, T-DSN's accuracy in classification tasks continues to improve as the minibatch size increases. For this reason, it is desirable to use the largest possible amount of training data at each iteration. In this section, we analyze the time and space complexities of our T-DSN training algorithm, and introduce a parallel training method that allows us to scale to large training sets.

To learn the T-DSN weights, we need to evaluate both the objective function and its gradients with respect to $W_{(1)}$ and $W_{(2)}$ at each iteration. Let $L = L_1 L_2$ denote the number of implicit hidden units. Then, computing the gradients defined by (17) involves a sequence of cached intermediate steps with the time complexities listed in Table 1, and has an overall space complexity of $O( (L+D+C)N )$ (due to the need to store $X$ , $\tilde{H}$ , and $\tilde{H}^\dagger$ in memory). Evaluating the objective function in (10) has a time complexity of $O( NL(C+D))$ with the same space complexity. In practice, for large enough $N$ this space complexity will exceed the main memory of a single machine, creating the ability to parallelize over many machines crucial for handling very large datasets.

Table 1. Gradient Computational Complexity, Assuming the Earlier Expressions Are Cached for Use in the Later Expressions

There are well-known techniques for parallel matrix multiplication of the form ${\bf A} = {\bf BC^T} \in {\hbox{\rlap{I}\kern 2.0pt{\hbox{R}}}}^{q \times r}$ ; in general, they break ${\bf B} \in {\hbox{\rlap{I}\kern 2.0pt{\hbox{R}}}}^{q \times N}$ and ${\bf C} \in {\hbox{\rlap{I}\kern 2.0pt{\hbox{R}}}}^{r \times N}$ into submatrices that can be combined to produce ${\bf A}$ . Because our multiplies will involve instances where common dimension of ${\bf B}$ and ${\bf C}^T$ (i.e., the number of training samples) is significantly larger than both $q$ and $r$ , we use the following basic matrix multiplication parallelization strategy:

$${\bf A} = \sum_{k=1}^P {\bf B}^{\langle k \rangle }{\bf C}^{\langle k \rangle T},$$(21)

where ${\bf B}^{\langle k \rangle }$ denotes the $k$ th subblock of matrix ${\bf B}$ that has been divided into $P$ subblocks along the second dimension:

$${\bf B} = [ {\bf B}^{\langle 1 \rangle } \;\; {\bf B}^{\langle 2 \rangle } \;\; \cdots \;\; {\bf B}^{\langle P \rangle } ].$$(22)

And likewise, we have for matrix ${\bf C}^T$ :

$${\bf C}^T = \left[ \matrix{\bf C^{\langle 1 \rangle T} \cr \vdots \cr {\bf C}^{T \langle P \rangle }} \right].$$(23)

Decomposing the large matrix operations into many small submatrix operations is the key to parallelizing the gradient and objective function computation.

Given current values for the T-DSN parameters ${\bf W}^{\prime }_{(1)}$ and ${\bf W}^{\prime }_{(2)}$ , we use the above parallelization strategy to compute the objective function value and gradients. Our parallelization is over training data points: We split any matrix ${\bf M}$ with a second dimension $N$ into $P$ submatrices ${\bf M}^{\langle k \rangle }$ , each of which has $N_k$ columns.

The computation pipeline is broken into a large number of jobs, as illustrated by the directed acyclic graph in Fig. 3. The arrows denote dependence: A job can run once all of the jobs feeding into it have completed. The results of each job in Fig. 3 are cached and used in subsequent processing. There are three qualitatively different kinds of jobs, denoted by the different shapes in the figure. The green three-dimensional boxes each denote a set of $P$ jobs, where the individual jobs process a fraction of the total dataset (the $k$ th batch). The orange rectangle jobs are accumulators, and only need to sum over the $P$ individual files on which they depend. Because they begin accumulating the sum of their ancestors as each one finishes, they introduce minimal delay into the pipeline. The jobs marked with red octagons cannot be run until all ancestor jobs have been finished since they require synchronizing over the $P$ batches.

Figure Fig. 3. T-DSN parallel pipeline. The three-dimensional boxes denote sets of $P$ parallel jobs, rectangles denote accumulator jobs and octagons denote sequential jobs. All other values from Table 2 are recomputed as needed.

Table 2 lists the full set of intermediate variables that must be computed to construct the gradient and evaluate the function; it is a superset of the variables listed in Fig. 3. Note that no variable that is dependent on the dataset size (i.e., has a dimension equal to $N_k$ ) is cached. These variables are recomputed according to their definitions each time as the cost of caching these variables to the disk was found to exceed the cost of recomputing them.

Table 2. The Variables Computed in the Parallel Pipeline, Their Dimension, and the Mathematical Operations Required to Produce Them

At the conclusion of the parallel pipeline, the variable $s$ denotes the objective function $f$ evaluated at $f({\bf W}^{\prime }_{(1)},{\bf W}^{\prime }_{(2)})$ , and the variable ${\bf G}$ denotes the concatenation of the gradients, $\nabla_{{W}_{(i)}} f$ .

Our original and primary motivation for the parallel implementation was to allow us to scale beyond the memory limit of a single machine. We also measure the effect of parallelization on speed. As usual, there is cost associated with parallelization, namely, the interprocess communication time. Because our implementation uses network disk to store and load cached variables, this cost is nontrivial. Fig. 4 measures empirical wall-clock runtimes over repeated single instances of the parallel pipeline (i.e., each computing the gradient and evaluating the objective) on the TIMIT data (1.12m training samples). Specifically, the mean and the upper and lower 95 percent confidence intervals are plotted. To produce the timing results in this figure, we use hidden representations of dimension $L_1=L_2=40$ , and repeat a single instance of the parallel pipeline eight times over the number of machines, $P$ , across which the parallel training is distributed. On each machine processing is parallelized over eight cores. A fixed, additional overhead associated with initializing the data is also included in the presented times; in practice, these overhead costs would become negligible over the course of training a multiple-block T-DSN. The minimum value for $P$ is four since lower values caused the compute nodes' memory to be exceeded. For this dataset, the lowest average times are achieved in the range between $P=10$ and $P=25$ (80-200 cores). After this, there is a gradual rise in the total computation times as improvements in computation time are outpaced by the additional disk access and communication costs. In practice, because the speedup is relatively insensitive to the degree of parallelization, we simply set $P$ to be sufficiently large that training does not exceed the compute nodes' memory limits. Note that the stacking nature of the T-DSN means that one cannot parallelize over blocks, only within blocks. Because, in practice, the number of blocks is limited and the bulk of the computation is spent within blocks, this does not prove to be a major obstacle to training.

Fig. 4. Empirical wall-clock timings by the number of cores for completing the parallel pipeline, repeating eight runs with $L_1=L_2=40$ for each degree of parallelization measured. Upper and lower 95 percent confidence intervals are plotted in blue; average time is dashed white.

In this section, we detail the experiments and present the results aimed at evaluating the effectiveness of the T-DSN architecture described in the preceding sections. Three well-known image and speech databases for benchmarking are used: MNIST, TIMIT, and WSJ.

In the first set of experiments, we evaluate the T-DSN architecture and the learning algorithms on the MNIST database of binary images of handwritten digits [ ^{18}]. The digits have been size-normalized while preserving their aspect ratio. Each original image is centered by computing and translating the center of mass of the pixels, yielding a $28 \times 28$ image. The task is to classify each $28 \times 28$ image into one of the 10 digits. The MNIST training set is composed of 60,000 examples from approximately 250 writers. The test set is composed of 10,000 patterns. The sets of writers of the training set and test set are disjoint. In the experiments, a small fraction of the training data is held out as a validation set to tune hyperparameters in the T-DSN. The properties of the validation and test sets in MNIST are found to be very similar to each other.

The architecture of the T-DSN used in the MNIST experiment was shown in Fig. 1, except with four stacking blocks instead of three. The input layer at the bottom block consists of 784 units, one for each black-white pixel in the $28 \times 28$ image. Each of the two hidden layers consists of $L_1$ and $L_2$ sigmoidal units, denoted by $L_1 \times L_2$ as its hidden-layer configuration. All blocks have the same hidden-layer configuration in all experiments reported in this section. The prediction layer of all blocks has 10 linear units, corresponding to 10 digit output classes. The input layer at the nonbottom blocks is a concatenation of the raw input data and the prediction layer's output from the previous block, thus having the dimensionality of 794.

Fig. 5 shows the training objective (mean square error) between the prediction layer's output and the zero-one target averaged over the full training set as a function of each T-DSN block and also as a function of epochs in the batch-mode gradient-decent training for each T-DSN block. The corresponding test-set classification error rate is shown in Fig. 6. The hidden-layer configuration of the T-DSN used here is $L_1 \times L_2 =90 \times 90$ . The two sets of input matrix weights to the two hidden layers in each T-DSN block are initialized with small uniform random numbers for parallel gradient-decent training.

Figure Fig. 5. MNIST: Training objective (mean square error) at each of the training epochs for each block of the T-DSN with hidden-layer configuration of $90 \times 90$ .

Figure Fig. 6. MNIST: Test-set error rate as a function of training epochs at each block of the T-DSN with hidden-layer configuration of $90 \times 90$ .

The results in Figs. 5 and 6 show that when a new block is added to the T-DSN, both the training objective function and the test error rate are reduced even if the training in the previous block already reached near convergence. Further, we observe in Fig. 5 that as more blocks are added, the training objective continues to drop to close to zero and, although not shown in the figure, the error rate for the training set drops to zero as well. There is no obvious rise in errors observed for the test set (see Fig. 6).

In Table 3, we show the relationship between the test error rate and the hidden-layer configuration. The error rate is obtained at the convergence of the training with the optimal number of blocks and several other hyperparameters (learning rates, size of the initial random weights, and so on.) determined on the validation set. We observe that with the same number of implicit hidden units, the symmetric configuration is significantly better than nonsymmetric configurations. Also, the hidden layers should be sufficiently large to produce low error rate, which is possibly limited by the amount of training data and can be determined on the validation set.

Table 3. MNIST: Test-Set Error Rate at Convergence as a Function of Hidden-Layer Configuration

As an extreme case, when one of the two hidden layers in each block reduces to a single unit, the corresponding T-DSN behaves like a DSN with a significantly higher error rate—e.g., $30 \times 30$ versus $900 \times 1$ —as shown in Table 3.

The MNIST website provides the results of 68 classifiers. A very large and deep convolutional neural network gives the state-of-the-art error rate of 0.39 percent [ ^{19}]. The use of distortions to augment the training data is important to achieve this lowest error rate. Without the use of the distortions, which is impractical in real applications, the error rate was increased to 0.53 percent. Without the use of convolutional structure and distortions or any other types of special preprocessing, the lowest error rate, 0.83 percent, was reported in [ ^{3}] by a carefully tuned and optimized DSN. The error rate of 1.21 percent reported in this paper is comparable to that achieved using the deep belief network as reported in [ ^{9}]. It was obtained without careful tuning, without passing of the learned weights from one block to another, and without pretraining of weights using restricted Boltzmann machines, which were all exploited in [ ^{3}].

In the second set of experiments, the TIMIT database is used for evaluating the T-DSN. The training set consists of 462 speakers. The total number of frames in the training data is 1,124,589. The validation set provided by the database contains 50 speakers, with a total of 122,488 frames. Results are reported using the standard 24-speaker core test set consisting of 192 sentences with 7,333 phone tokens and 57,920 frames.

The speech data is analyzed using standard Mel frequency cepstral coefficients (MFCCs). All experiments used a context window of 11 frames. This gives a total of $39\cdot 11=429$ elements in each feature vector as the raw input to T-DSN. This window size was shown to be optimal for the TIMIT phone recognition task in different kinds of deep networks published earlier (e.g., [ ^{20}]) and has not been customized for the T-DSN in this study. For the prediction at each layer of the T-DSN, we use 183 target class labels (i.e., three states for each of the 61 phones), which we call “phone states,” with a zero-one (also called one-hot) encoding scheme. Phone boundaries are labeled in the corpus; we obtain the phone state labels by a state-frame alignment using a strong GMM-HMM system, as is common for recent deep learning work for speech recognition.

The results reported in this section are obtained using the main T-DSN architecture illustrated in Fig. 1, where the number of stacking blocks is between 8 to 13 as determined on the validation set. In some experiments, additional one or more hidden layer(s) and a softmax layer are added to the top of the T-DSN for computing frame-level state posterior probabilities. This latter step is needed for the phone recognition task when a further dynamic programming step is used to reach the phone recognition decision. Only symmetric hidden-layer configurations are used, and we tune configurations between $L_1 \times L_2 = 70 \times 70$ and $100 \times 100$ .

In Tables 4a and 4b, we compare the frame-level State classification error rates (F-SER) with (b) and without (a) using a trained softmax layer on top of the T-DSN. Obtaining the results of F-SER requires no additional postprocessing, making the TIMIT experiment as simple as MNIST. Comparing (a) and (b), we observe noticeable error reduction after the softmax layer is added. In each case, we also compare T-DSN and its corresponding DSN. Similarly to the MNIST experiments, the T-DSN gives lower errors, especially in the case of softmax output (b). With the softmax output, we can also evaluate using the cross entropy (CE) measure for the test set. Cross entropy is the average of negative log (base- $e$ ) posterior probabilities over all frames in the test set, computed from the softmax layer. The lower the cross entropy, the better the performance. In Table 4b, we further compare T-DSN with two versions [ ^{20}], [ ^{21}] of DNN, and show that the T-DSN and DSN are both superior to DNN in both error rate and cross entropy.

Table 4. TIMIT: Comparing T-DSN (and DSN) (a) Before and (b) After Adding Softmax Layers to Produce Posterior Probabilities, in Terms of Frame-Level State Error Rate (F-SER) and CE Value

We now present a new set of TIMIT results, which are more meaningful to speech researchers, after postprocessing of the T-DSN outputs in Table 5. The first measure is framewise phone error rate, computed by 1) collapsing three sequential units (states) associated with each phone into one single phone class using majority voting over all frames within the phone boundaries in each test sentence as provided by the TIMIT database; and 2) collapsing a total of 183 output units into 39 phone-like units [ ^{22}]. Using this new measure after collapsing, we observe a lower error rate using T-DSN than DSN and two versions of DNN. When the outputs of T-DSN are fed further to a five-hidden-layer DNN, denoted as “T-DSN + DNN” in Table 5, the framewise phone error rate is dropped further.

Table 5. TIMIT: Comparing Two Versions of T-DSN, DSN, and DNN in Terms of Frame-Level Phone Error Rate and of Continuous Phone Recognition Error Rate

The second measure (shown in the last column of Table 5) is continuous phonetic recognition error rate. This is computed by 1) collapsing a total of 183 T-DSN output units into 39 phone-like units, 2) normalizing the softmax outputs of T-DSN by state priors so that the posterior probabilities of states are converted to a quantity proportional to data likelihoods for each state, and 3) using a dynamic programming step across full sentences to determine phone recognition errors (substitution, deletion, and insertion errors). This last step is also called “phonetic decoding,” where a standard bigram phone-level “language” model is used with the language model weight and insertion penalty tuned using the validation data. The results in Table 5 using this measure again demonstrate superior performance of T-DSN, especially when the outputs of T-DSN are further processed by a DNN.

The original motivation of this work was to make the learning of deep networks scalable by replacing stochastic gradient descent algorithm for fine tuning with the parallelizable batch-mode learning. As both the DSN and T-DSN do “fine tuning” only within the block rather than through the entire deep network as carried out for DNN, we expected at best a matching performance to DNN. While DSN alone has not matched the low phonetic recognition error rate achieved by DNN, T-DSN produces a slightly lower error rate (22.8 percent versus 22.9 percent). Further, for the measures of frame-level error rates and cross entropy, both DSN and T-DSN outperform DNN and even MMI-DNN [ ^{20}] . The state-of-the-art TIMIT phone recognition error rate is 20.1 percent, reported and analyzed very recently in [ ^{23}], lower than 22.8 percent reported here. The differences are due to 1) the use of specially designed convolutional structure, 2) the use of filterband instead of MFCC features, and 3) very expensive optimization using backpropagation. The first two of these differences can be incorporated into the T-DSN architecture (as future work).

For pure classification problems such as MNIST and frame-level phone classification, we have found little difference between mean square error and cross entropy as the loss. However, for continuous phone recognition requiring the use of an HMM to interface with the frame-level classifier, the output needs to be in the form of probabilities. In our experiments reported in the right column of Table 5, we use cross entropy for learning, which is substantially more expensive in computation obtains good results.

In the third set of experiments, we use another popular but larger speech database, called 5k-WSJ0, designed for speaker independent speech recognition tasks [ ^{25}]. As suggested by the name, the 5k-WSJ0 database uses a 5,000 word vocabulary. The training material from the SI84 set (7,077 utterances, or 15.3 hours of speech from 84 speakers) in the database is separated into a 6,877-utterance training set and a 200-sentence validation set. Evaluation was carried out on the Nov92 evaluation data with 330 utterances from eight speakers. In this paper, we use the same MFCCs and their deltas as in the TIMIT experiments for the short-time spectral representation of the speech signal. With the 10 millisecond frame rate, this database gives over 5-million frames (i.e., samples) in the training data (5,232,244 to be exact), substantially larger than MNIST and TIMIT. Further, unlike the TIMIT database where the phone boundaries in training data are provided by human annotators, no phone boundaries are given in WSJ0. In this paper, we generate the phone labels and their boundaries in the training data from the forced alignments using a tied-state crossword triphone Gaussian-mixture-HMM speech recognizer. Test-set labels are produced in the same way. These phone labels, with a total of 40 of them, together with their boundaries provide one-to-one mapping between each speech frame with its phone label as the target for training the T-DSN.

In Table 6, we show the performance of a single block of the T-DSN, measured by the frame-level phone classification error rate. In the results presented in Table 6, the two weight matrices ${\bf W}_{(1)}$ and ${\bf W}_{(2)}$ in Fig. 1 are randomized and not learned via gradient descent. Learning is applied only to tensor ${\bf U}$ according to (11). Consistent with the MNIST and TIMIT results, we also observe here that larger (and symmetric) hidden layers are better than smaller ones in the T-DSN. Further, a window size of 11 gives a noticeably lower error rate than 7, which in turn gives a further lower error rate than a single frame.

Table 6. WSJ: Frame-Level Phone Classification Error Rate Achieved with Only One Block and with Random Weight Matrices ${\bf W}_{(1)}$ and ${\bf W}_{(2)}$ in Fig. 1, as a Function of Hidden-Layer Configuration and of the Input Window Size

In Table 7, frame-level phone classification error rates are shown after five blocks of T-DSN are built, where each block runs gradient-decent learning until convergence with details described in Sections 3 and 4. No softmax layer is added to top of the T-DSN. Again, at the learning convergence, the T-DSNs with larger hidden layer sizes and larger input window sizes are superior to those using smaller ones. Importantly, each entry in Table 7 shows a significantly lower error rate than the corresponding entry in Table 6, demonstrating the effectiveness of building deep T-DSN and of the learning algorithms with its parallel implementation described in Sections 3 and 4.

Table 7. WSJ: Frame-Level Phone Classification Error Rate, After Stacking Five Blocks and Training Weight Matrices ${\bf W}_{(1)}$ and ${\bf W}_{(2)}$ for Each Block, as a Function of Hidden-Layer Configuration and of the Input Window Size

A new architecture for deep learning is presented, the T-DSN, generalizing the earlier DSN architecture. The principal novelty is to split the original large hidden layer (in each block) into two smaller ones and, through their multiplicative outer product and the associated tensor weights, to create a bilinear model exploiting the higher order covariance structure in binary ( $[0,1]$ ) hidden feature interactions.

The T-DSN retains the computational advantage of the DSN in parallelism and scalability during learning all parameters, including the second layer tensor and the first layer projection weight matrices. Note that the parallelism in learning the T-DSN can be implemented either in a CPU cluster (as carried out in the current study) or in a GPU cluster. A single GPU parallelization speed up over CPU can be between $10\hbox{-}100\times$ but CPU programming is easier and CPUs are much more affordable than GPUs. All the experimental results presented in this paper have been obtained by parallel implementation of the learning algorithm described in Section 4, using a cluster of CPUs exclusively.

In addition to the above main strengths, the T-DSN has another advantage over the earlier DSN architecture (an advantage shared by the deep tensor neural network [ ^{26}] over the DNN) in its potential to explicitly incorporate speaker and/or environmental factors by training one of the hidden representations to encode speaker or environmental information while effectively gating the other hidden-to-output mapping. Moreover, the T-DSN is equipped with the new stacking mechanism where the more compact dual hidden representations can be concatenated with the input data in stacking the T-DSN blocks. The significantly smaller hidden representation size in the T-DSN than DSN has the effect of bottle-necking the data, aiding “stackability” in the deep architecture by providing flexibility in stacking choices. One can concatenate the raw input data ${\bf x}$ with ${\bf h}_{(1)}$ and ${\bf h}_{(2)}$ instead of the output ${\bf y}$ , which may potentially be very large in some applications. The bottle-necking effect would permit the T-DSN to pass more information between the blocks without dramatically increasing the input dimension in the higher level blocks.

With the parallelized implementation of T-DSN already in place, we expect meaningful improvements in real-world speech recognition and other pattern recognition tasks. Further, encouraged by our recent results of DNN and DSN in applications of speech understanding [ ^{27}] and in speech attribute detection [ ^{28}], we expect greater success with the use of T-DSN in these and other applications.

A#### T-DSN Gradient Derivation

**Lemma A.1.**

*Let ${\bf A}$ denote $( {\bf \tilde{H}}{\bf \tilde{H}}^T + \lambda {\bf I} )^{-1}$ (as before), and let ${\bf Z}$ denote an arbitrary real $N \times N$ matrix. Then,*$$\nabla_{{\bf \tilde{H}}^T}{\rm Tr} ({\bf A}^{-1}{\bf \tilde{H}}{\bf Z}{\bf \tilde{H}}^T ) = ( {\bf I} - {\bf \tilde{H}}^\ddagger {\bf \tilde{H}} ) ({\bf Z} + {\bf Z}^T){\bf \tilde{H}}^\ddagger.$$*When ${\bf Z}$ is symmetric, this simplifies to*$$\nabla_{{\bf \tilde{H}}^T}{\rm Tr} \big({\bf A}^{-1}{\bf \tilde{H}}{\bf Z}{\bf \tilde{H}}^T \big) = 2 ( {\bf I} - {\bf \tilde{H}}^\ddagger {\bf \tilde{H}} ) {\bf Z}{\bf \tilde{H}}^\ddagger.$$**Lemma A.2.**

$$\eqalign{&\nabla_{{\bf \tilde{H}}^T}{\rm Tr} \big({\bf T}{\bf \tilde{H}}^\ddagger {\bf \tilde{H}}{\bf \tilde{H}}^T {\bf \tilde{H}}^{\ddagger T}{\bf T}^T \big) \cr &\quad= 2({\bf I}-{\bf \tilde{H}}^\ddagger {\bf \tilde{H}}) {\bf \tilde{H}}^T {\bf \tilde{H}}^{\ddagger T}{\bf T}^T {\bf T}{\bf \tilde{H}}^\ddagger \cr &\qquad+ \; 2({\bf I}-{\bf \tilde{H}}^\ddagger {\bf \tilde{H}}) {\bf T}^T {\bf T}{\bf \tilde{H}}^\ddagger {\bf \tilde{H}}{\bf \tilde{H}}^\ddagger.}$$**Lemma A.3.**

$$\nabla_{{\bf \tilde{H}}^T}{\rm Tr} \big( {\bf T}{\bf \tilde{H}}^\ddagger {\bf \tilde{H}}{\bf T}^T \big) = 2 ({\bf I} - {\bf \tilde{H}}^\ddagger {\bf \tilde{H}} ) {\bf T}^T {\bf T}{\bf \tilde{H}}^\ddagger.$$**Lemma A.5.**

$$\eqalign{\nabla_{{\bf \tilde{H}}^T}{1\over \lambda } \beta ({\bf \tilde{H}})& = \nabla_{{\bf \tilde{H}}^T}{\rm Tr} \left( {\bf U}{\bf U}^T \right) =2 {\bf T}^T {\bf T}{\bf \tilde{H}}^\ddagger {\bf A}^{-1} \cr&\quad- 2 {\bf \tilde{H}}^\ddagger \big( {\bf \tilde{H}}{\bf T}^T {\bf T}{\bf \tilde{H}}^\ddagger + {\bf \tilde{H}}^{\ddagger T}{\bf T}^T {\bf T}{\bf \tilde{H}}^T \big) {\bf A}^{-1}.}$$**Theorem A.6.**

$$\eqalign{\nabla_{{\bf \tilde{H}}^T} f &= 2({\bf I}-{\bf \tilde{H}}^\ddagger {\bf \tilde{H}}) {\bf \tilde{H}}^\ddagger {\bf \tilde{H}}{\bf T}^T {\bf T}{\bf \tilde{H}}^\ddagger \cr &\quad+ \; 2({\bf I}-{\bf \tilde{H}}^\ddagger {\bf \tilde{H}}) {\bf T}^T {\bf T}{\bf \tilde{H}}^\ddagger {\bf \tilde{H}}{\bf \tilde{H}}^\ddagger \cr &\quad- \; 4 ({\bf I} - {\bf \tilde{H}}^\ddagger {\bf \tilde{H}} ) {\bf T}^T {\bf T}{\bf \tilde{H}}^\ddagger \cr &\quad+ \; \lambda 2 {\bf T}^T {\bf T}{\bf \tilde{H}}^\ddagger {\bf A}^{-1} \cr &\quad- \; \lambda 2 {\bf \tilde{H}}^\ddagger \big( {\bf \tilde{H}}{\bf T}^T {\bf T}{\bf \tilde{H}}^\ddagger + {\bf \tilde{H}}^{\ddagger T}{\bf T}^T {\bf T}{\bf \tilde{H}}^T \big) {\bf A}^{-1}.}$$

In this appendix, we derive the gradients used for training our lower level weight matrices, ${\bf W}_{(1)}$ and ${\bf W}_{(2)}$ , under the most general conditions.

A.1 Finding the Optimal ${\bf U}^T$Given a fixed implicit hidden representation matrix ${\bf \tilde{H}}$ , consider the Tikhonov regularized least-squares objective

$$f = \Vert {\bf U}^T {\bf \tilde{H}} - {\bf T} \Vert_F^2 + \lambda \Vert {\bf U} \Vert_F^2.$$(24)

The well-known closed-form solution to this problem is

$${\bf U}^T = {\bf T}{\bf \tilde{H}}^\ddagger, \quad {\bf \tilde{H}}^\ddagger = {\bf \tilde{H}}^T {\bf A}^{-1}, \quad {\bf A} = ({\bf \tilde{H}}{\bf \tilde{H}}^T + \lambda {\bf I})$$(25)

We reserve the notation ${\bf \tilde{H}}^\dagger$ for that pseudo-inverse of ${\bf \tilde{H}}$ , which is clearly equal to ${\bf \tilde{H}}^\ddagger$ when $\lambda = 0$ .

A.2 Deriving $\nabla_{{\bf \tilde{H}}^T} f$We can substitute the closed form solution for ${\bf U}^T$ given in (25) into the objective function. Our ultimate goal is to express this as a function of ${\bf W}_{(1)}$ and ${\bf W}_{(2)}$ , and compute the gradients with respect to these lower level weight matrices. As an intermediate step, we first compute the gradient of the objective with respect to ${\bf \tilde{H}}^T$ .

A.2.1 General Gradient$$\eqalign{f &= \Vert {\bf U}^T {\bf \tilde{H}} - {\bf T} \Vert_F^2 + \lambda \Vert {\bf U} \Vert_F^2 \cr & ={\rm Tr} (({\bf U}^T {\bf \tilde{H}} - {\bf T})( {\bf U}^T {\bf \tilde{H}} - {\bf T})^T )+\lambda {\rm Tr} ({\bf U}{\bf U}^T ).}$$(26)

Denote the first term of (26) by $\alpha ({\bf \tilde{H}})$ and the second term by $\beta ({\bf \tilde{H}})$ . First, we derive $\nabla_{{\bf \tilde{H}}^T} \alpha ({\bf \tilde{H}})$ . By the linearity of trace, we obtain

$$\eqalign{\nabla_{{\bf \tilde{H}}^T} \alpha ({\bf \tilde{H}}) & = \nabla_{{\bf \tilde{H}}^T}{\rm Tr} ({\bf T}{\bf \tilde{H}}^\ddagger {\bf \tilde{H}}{\bf \tilde{H}}^T {\bf \tilde{H}}^{\ddagger T}{\bf T}^T ) \cr & \quad+ -2 \nabla_{{\bf \tilde{H}}^T}{\rm Tr} ({\bf T}{\bf \tilde{H}}^\ddagger {\bf \tilde{H}}{\bf T}^T).}$$(27)

Before we can evaluate $\nabla_{{\bf \tilde{H}}^T} f$ , let us introduce five lemmas.

(28)

(29)

Proof.Let ${\bf P}$ denote the constant matrix that is the result of evaluating ${\bf A}^{-1}$ with a fixed ${\bf \tilde{H}}$ . Let ${\bf Q}$ denote the constant matrix that is the result of evaluating ${\bf \tilde{H}}{\bf Z}{\bf \tilde{H}}^T$ with a fixed ${\bf \tilde{H}}$ . It follows that

$$\eqalign{& \nabla_{{\bf \tilde{H}}^T}{\rm Tr} \big( {\bf A}^{-1}{\bf \tilde{H}}{\bf Z}{\bf \tilde{H}}^T \big) \cr & \quad= \nabla_{{\bf \tilde{H}}^T}{\rm Tr} \big({\bf P}{\bf \tilde{H}}{\bf Z}{\bf \tilde{H}}^T \big) + \nabla_{{\bf \tilde{H}}^T}{\rm Tr} \left( {\bf A}^{-1}{\bf Q} \right).}$$(30)

Using (107) from [ ^{29}], and noting ${\bf A}$ 's symmetry, we can evaluate the first term in (30)

(31)

Using (114) from [ ^{29}] we can evaluate the second term in (30)

(32)

Substituting (31) and (32) into (30) completes the first part of the lemma. The second part is trivial: For symmetric ${\bf Z}$ we have ${\bf Z} + {\bf Z}^T = 2{\bf Z}$ .

$$\eqalign{&\nabla_{{\bf \tilde{H}}^T}{\rm Tr} \big({\bf T}{\bf \tilde{H}}^\ddagger {\bf \tilde{H}}{\bf \tilde{H}}^T {\bf \tilde{H}}^{\ddagger T}{\bf T}^T \big) \cr &\quad= 2({\bf I}-{\bf \tilde{H}}^\ddagger {\bf \tilde{H}}) {\bf \tilde{H}}^T {\bf \tilde{H}}^{\ddagger T}{\bf T}^T {\bf T}{\bf \tilde{H}}^\ddagger \cr &\qquad+ \; 2({\bf I}-{\bf \tilde{H}}^\ddagger {\bf \tilde{H}}) {\bf T}^T {\bf T}{\bf \tilde{H}}^\ddagger {\bf \tilde{H}}{\bf \tilde{H}}^\ddagger.}$$

(33)

Proof.Let ${\bf P}$ denote the constant matrix that is the result of evaluating ${\bf T}{\bf \tilde{H}}^\ddagger {\bf \tilde{H}}$ for a fixed ${\bf \tilde{H}}$ ; then, using Lemma A.1,

$$\displaylines{\eqalign{&\nabla_{{\bf \tilde{H}}^T}{\rm Tr} \big({\bf T}{\bf \tilde{H}}^\ddagger {\bf \tilde{H}}{\bf \tilde{H}}^T {\bf \tilde{H}}^{\ddagger T}{\bf T}^T\big)\cr &\quad= \nabla_{{\bf \tilde{H}}^T} 2{\rm Tr} \big({\bf P}^T {\bf T}{\bf \tilde{H}}^\ddagger {\bf \tilde{H}}\big)\cr &\quad= \nabla_{{\bf \tilde{H}}^T} 2 {\rm Tr} \big({\bf A}^{-1}{\bf \tilde{H}}{\bf P}^T {\bf T}{\bf \tilde{H}}^T \big)\cr &\quad= 2({\bf I}-{\bf \tilde{H}}^\ddagger {\bf \tilde{H}}) {\bf \tilde{H}}^T {\bf \tilde{H}}^{\ddagger T}{\bf T}^T {\bf T}{\bf \tilde{H}}^\ddagger\cr &\qquad+ \; 2({\bf I}-{\bf \tilde{H}}^\ddagger {\bf \tilde{H}}) {\bf T}^T {\bf T}{\bf \tilde{H}}^\ddagger {\bf \tilde{H}}{\bf \tilde{H}}^\ddagger .}\cr\hfill\rlap{$\sqcap$}{\sqcup}}$$$$\nabla_{{\bf \tilde{H}}^T}{\rm Tr} \big( {\bf T}{\bf \tilde{H}}^\ddagger {\bf \tilde{H}}{\bf T}^T \big) = 2 ({\bf I} - {\bf \tilde{H}}^\ddagger {\bf \tilde{H}} ) {\bf T}^T {\bf T}{\bf \tilde{H}}^\ddagger.$$

(34)

Proof.This follows from Lemma A.2 and the fact that ${\rm Tr}(PQ) = {\rm Tr}(QP)$ :

$$\eqalign{\nabla_{{\bf \tilde{H}}^T}{\rm Tr} \big({\bf T}{\bf \tilde{H}}^\ddagger {\bf \tilde{H}}{\bf T}^T\big) &= \nabla_{{\bf \tilde{H}}^T}{\rm Tr} \big( {\bf T}{\bf \tilde{H}}^T {\bf A}^{-1}{\bf \tilde{H}}{\bf T}^T \big) \cr &= \nabla_{{\bf \tilde{H}}^T}{\rm Tr} \big({\bf A}^{-1}{\bf \tilde{H}}{\bf T}^T {\bf T}{\bf \tilde{H}}^T \big) \cr &= 2 ({\bf I} - {\bf \tilde{H}}^\ddagger {\bf \tilde{H}} ) {\bf T}^T {\bf T}{\bf \tilde{H}}^\ddagger .}$$(35)

**Lemma A.4.**

$$\eqalign{\nabla_{{\bf \tilde{H}}^T} \alpha ({\bf \tilde{H}}) &= 2({\bf I}-{\bf \tilde{H}}^\ddagger {\bf \tilde{H}}) {\bf \tilde{H}}^T {\bf \tilde{H}}^{\ddagger T}{\bf T}^T {\bf T}{\bf \tilde{H}}^\ddagger \cr &\quad+ \; 2({\bf I}-{\bf \tilde{H}}^\ddagger {\bf \tilde{H}}) {\bf T}^T {\bf T}{\bf \tilde{H}}^\ddagger {\bf \tilde{H}}{\bf \tilde{H}}^\ddagger \cr &\quad- \; 4 ({\bf I} - {\bf \tilde{H}}^\ddagger {\bf \tilde{H}} ) {\bf T}^T {\bf T}{\bf \tilde{H}}^\ddagger.}$$

(36)

Proof.This follows from (27) and Lemmas A.2 and A.3

$$\eqalign{\nabla_{{\bf \tilde{H}}^T}{1\over \lambda } \beta ({\bf \tilde{H}})& = \nabla_{{\bf \tilde{H}}^T}{\rm Tr} \left( {\bf U}{\bf U}^T \right) =2 {\bf T}^T {\bf T}{\bf \tilde{H}}^\ddagger {\bf A}^{-1} \cr&\quad- 2 {\bf \tilde{H}}^\ddagger \big( {\bf \tilde{H}}{\bf T}^T {\bf T}{\bf \tilde{H}}^\ddagger + {\bf \tilde{H}}^{\ddagger T}{\bf T}^T {\bf T}{\bf \tilde{H}}^T \big) {\bf A}^{-1}.}$$

(37)

Proof.Let ${\bf P}$ denote a constant matrix that is the result of evaluating ${\bf U}^T {\bf A}^{-1}$ with a fixed ${\bf \tilde{H}}$ . Let ${\bf Q}$ denote a constant matrix that is the result of evaluating ${\bf \tilde{H}}{\bf T}^T$ with a fixed ${\bf \tilde{H}}$ . It follows that

$$\eqalign{\nabla_{{\bf \tilde{H}}^T}{\rm Tr} \left( {\bf U}{\bf U}^T \right) &= \nabla_{{\bf \tilde{H}}^T} 2 {\rm Tr} \left( {\bf P}{\bf \tilde{H}}{\bf T}^T \right) \cr &\quad+ \nabla_{{\bf \tilde{H}}^T} 2 {\rm Tr} \left( {\bf U}^T {\bf A}^{-1}{\bf Q} \right).}$$(38)

The first term of (38) is

$$\eqalign{\nabla_{{\bf \tilde{H}}^T} 2 {\rm Tr} \left( {\bf P}{\bf \tilde{H}}{\bf T}^T \right) &= \nabla_{{\bf \tilde{H}}^T} 2 {\rm Tr} \left( {\bf \tilde{H}}{\bf T}^T {\bf P}\right) \cr &= 2 {\bf T}^T {\bf T}{\bf \tilde{H}}^\ddagger {\bf A}^{-1}.}$$(39)

The second term of (38) is

$$\eqalign{&\nabla_{{\bf \tilde{H}}^T} 2 {\rm Tr} \left( {\bf U}^T {\bf A}^{-1}{\bf Q} \right) \cr &\quad= -2 {\bf \tilde{H}}^\ddagger ( {\bf Q}{\bf U}^T + {\bf U}{\bf Q}^T ) {\bf A}^{-1} \cr &\quad= -2 {\bf \tilde{H}}^\ddagger ( {\bf \tilde{H}}{\bf T}^T {\bf T}{\bf \tilde{H}}^\ddagger + {\bf \tilde{H}}^{\ddagger T}{\bf T}^T {\bf T}{\bf \tilde{H}}^T ) {\bf A}^{-1},}$$(40)

where the first equality comes from [ ^{29}, (114)]. Substituting (39) and (40) into (38) proves the lemma.

$$\eqalign{\nabla_{{\bf \tilde{H}}^T} f &= 2({\bf I}-{\bf \tilde{H}}^\ddagger {\bf \tilde{H}}) {\bf \tilde{H}}^\ddagger {\bf \tilde{H}}{\bf T}^T {\bf T}{\bf \tilde{H}}^\ddagger \cr &\quad+ \; 2({\bf I}-{\bf \tilde{H}}^\ddagger {\bf \tilde{H}}) {\bf T}^T {\bf T}{\bf \tilde{H}}^\ddagger {\bf \tilde{H}}{\bf \tilde{H}}^\ddagger \cr &\quad- \; 4 ({\bf I} - {\bf \tilde{H}}^\ddagger {\bf \tilde{H}} ) {\bf T}^T {\bf T}{\bf \tilde{H}}^\ddagger \cr &\quad+ \; \lambda 2 {\bf T}^T {\bf T}{\bf \tilde{H}}^\ddagger {\bf A}^{-1} \cr &\quad- \; \lambda 2 {\bf \tilde{H}}^\ddagger \big( {\bf \tilde{H}}{\bf T}^T {\bf T}{\bf \tilde{H}}^\ddagger + {\bf \tilde{H}}^{\ddagger T}{\bf T}^T {\bf T}{\bf \tilde{H}}^T \big) {\bf A}^{-1}.}$$

(41)

Proof.This follows from (26) with Lemmas A.4 and A.5.

We use ${\bf \Theta }$ to denote $\nabla_{{\bf \tilde{H}}^T} f$ throughout this paper.

A.2.2 Simplified GradientWhen no regularization is used in the objective (i.e., $\lambda = 0$ ), the gradient can be simplified. Importantly, in this case ${\bf \tilde{H}}^\ddagger = {\bf \tilde{H}}^\dagger$ (the pseudo-inverse). Recall that ${\bf \tilde{H}}^\dagger {\bf \tilde{H}}{\bf \tilde{H}}^\dagger = {\bf \tilde{H}}^\dagger$ and ${\bf \tilde{H}}{\bf \tilde{H}}^\dagger {\bf \tilde{H}} = {\bf \tilde{H}}$ . Under these conditions:

$$\eqalign{\nabla_{{\bf \tilde{H}}^T} f &= 2({\bf I}-{\bf \tilde{H}}^\dagger {\bf \tilde{H}}) {\bf \tilde{H}}^\dagger {\bf \tilde{H}}{\bf T}^T {\bf T}{\bf \tilde{H}}^\dagger\cr &\quad+ \; 2({\bf I}-{\bf \tilde{H}}^\dagger {\bf \tilde{H}}) {\bf T}^T {\bf T}{\bf \tilde{H}}^\dagger {\bf \tilde{H}}{\bf \tilde{H}}^\dagger \cr &\quad- \; 4 ({\bf I} - {\bf \tilde{H}}^\dagger {\bf \tilde{H}} ) {\bf T}^T {\bf T}{\bf \tilde{H}}^\dagger }$$(42)

$$= 2 {\bf \tilde{H}^\dagger {\bf \tilde{H}}{\bf T}^T {\bf T}{\bf \tilde{H}}^\dagger - 2{\bf T}^T {\bf T}{\bf \tilde{H}}^\dagger.}$$(43)

Note this is the particular form of ${\bf \Theta }$ in (12) presented in Section 3 earlier.

A.3 Deriving $\nabla_{{\bf W}_{(i)}} f$To train our model, we need to compute $\nabla_{{\bf W}_{(1)}} f$ and $\nabla_{{\bf W}_{(2)}} f$ . We employ the chain rule (e.g., as stated in [ ^{29}, (126)]):

(44)

The notation $\langle \cdot, \cdot \rangle$ denotes matrix inner product: It is an element-wise multiplication followed by a sum.

First, we find $\nabla_{\bf H_{(i)}} f$ . It follows from (44) that

$$\left[ \nabla_{{\bf H}_{(1)}} f({\bf \tilde{H}}) \right]_{in} = \left\langle {\bf \Theta }^T, {\partial {\bf \tilde{H}}\over \partial H_{(1)in}} \right\rangle.$$(45)

By the definition of the Khatri-Rao product, we get

$${\partial {\bf \tilde{H}}\over \partial H_{(1)in}} = {\bf E}_{(i,n)}^{L_1 \times N} \odot {\bf H}_{(2)},$$(46)

where, as used earlier in the paper, ${\bf E}_{in}^{L_1 \times N}$ denotes an $L_1 \times N$ matrix that is zero everywhere except 1 in the $(i,n)$ th position. Let ${\bf \Psi }_{(1)}$ denote the matrix $\nabla_{{\bf H}_{(1)}} f({\bf \tilde{H}})$ .

Similarly, we use ${\bf \Psi }_{(2)}$ to denote $\nabla_{{\bf H}_{(2)}} f({\bf \tilde{H}})$ , where

$$\left[ \nabla_{{\bf H}_{(2)}} f({\bf \tilde{H}}) \right]_{jn} = \big\langle {\bf \Theta }^T, {\bf H}_{(1)} \odot {\bf E}_{(j,n)}^{L_2 \times N} \big\rangle.$$(47)

Let ${\bf Z}_{(i)}$ denote ${\bf W}_{(i)}^T {\bf X}$ . By another application of the chain rule, we get

$$[\nabla_{{\bf Z}_{(i)}} f({\bf H_{(i)}}) ]_{jk} = \left\langle {\bf \Psi }_{(i)}, {\partial {\bf H}_{(i)}\over \partial Z_{(i)jk}} \right\rangle .$$(48)

Recall that ${\bf H}_{(i)} = \sigma ( {\bf Z}_{(i)}^T )$ . The partial derivative is

$${\partial {\bf H}_{(i)}\over \partial Z_{(i)jk}} = {\bf E}_{jk}^{L_i \times N} \circ {\bf H}_{(i)} \circ ({\bf 1} - {\bf H}_{(i)}),$$(49)

where ${\bf 1} \in {\hbox{\rlap{I}\kern 2.0pt{\hbox{R}}}}^{L_I \times N}$ is the matrix of all ones. Since this fully decomposes over elements, we have

$$\nabla_{{\bf Z}_{(i)}} f({\bf H}_{(i)}) = {\bf H}_{(i)} \circ ({\bf 1} - {\bf H}_{(i)}) \circ {\bf \Psi }_{(i)}$$(50)

Finally, let ${\bf \Omega }_{(i)}$ denote $\nabla_{{\bf Z}_{(i)}} f({\bf H}_{(i)})$ . Then,

$$\left[ \nabla_{{\bf W}_{(i)}} f({\bf Z}_{(i)}) \right]_{jk} = \left\langle {\bf \Omega }_{(i)}, {\partial {\bf Z}_{(i)}\over \partial W_{(i)jk}} \right\rangle .$$(51)

The matrix ${\partial {\bf Z}_{(i)}\over \partial W_{(i)jk}}$ is an $L_i \times N$ matrix that is zero everywhere except for the $j$ th row, which contains the $k$ th row of ${\bf X}$ . Thus, taking the element-wise product of ${\bf \Omega }$ and ${\partial {\bf Z}_{(i)}\over \partial W_{(i)jk}}$ and then summing is equivalent to taking the inner product between the $j$ th row of ${\bf X}$ and the $k$ row of ${\bf \Omega }$ (i.e., the $k$ th column of ${\bf \Omega }^T$ ). Repeating this for all $j$ and $k$ can be expressed succinctly as a matrix-matrix product:

$$\nabla_{{\bf W}_{(i)}} f = {\bf X}{\bf \Omega }^T = {\bf X} \left( {\bf H}_{(i)}^T \circ ({\bf 1} - {\bf H}_{(i)})^T \circ {\bf \Psi }_{(i)}^T \right).$$(52)

This gives us an expression for the gradients $\nabla_{{\bf W}_{(i)}} f$ , which we can use to train a T-DSN block.

B#### Equivalencebetween the T-DSN and DSN Gradients

Under the conditions described in Section 3.1, namely, that ${\bf W}$ has dimension $D \times 1$ and is entry-wise zero, we show that the gradients $\nabla_{{\bf W}_{(2)}} f$ and $\nabla_{\bf W} f$ are equivalent. Let ${\bf \hat{\Theta }} = \nabla_{\bf \hat{H}} f$ be defined analogously to (12). Using the simplified gradient, it follows that

$$\eqalign{{\bf \hat{\Theta }} &= 2 \big({\bf \hat{H}^\dagger } ({\bf \hat{H}T}^T)({\bf T\hat{H}^\dagger }) - {\bf T}^T({\bf T\hat{H}^\dagger }) \big) \cr &= 4 \big({\bf H^\dagger } ({\bf HT}^T)({\bf TH^\dagger }) - {\bf T}^T({\bf TH^\dagger }) \big) = 2{\bf \Theta }.}$$(53)

Then, because each entry of $\hat{H}_{(1)}$ is $1/2$ ,

$${\bf \hat{\Psi }}_{(2)in} = \langle {\bf \hat{\Theta }}, {\bf \hat{H}}_{(1)} \odot {\bf E}_{(i,n)}^{L_2 \times N} \rangle \Rightarrow {\bf \hat{\Psi }}_{(2)} = {1\over 2}{\bf \hat{\Theta }} = {\bf \Theta }.$$(54)

Which gives us the following gradient:

$$\nabla_{{\bf W}_{(2)}} f = {\bf X}\big({\bf \hat{H}}^T_{(2)} \circ ({\bf 1} - {\bf \hat{H}}^T_{(2)}\big) \circ {\bf \Theta }) = \nabla_{{\bf W}} f.$$(55)

The gradients are identical under these conditions. For them to remain identical, ${\bf W}_{(1)}$ must be clamped to zero.

C#### Complementarity of the Hidden Views

In the T-DSN, an input vector ${\bf x} \in {\hbox{\rlap{I}\kern 2.0pt{\hbox{R}}}}^D$ is mapped simultaneously to two different hidden representations, ${\bf h}_1 \in {\hbox{\rlap{I}\kern 2.0pt{\hbox{R}}}}^{L_1}$ and ${\bf h}_2 \in {\hbox{\rlap{I}\kern 2.0pt{\hbox{R}}}}^{L_2}$ , via

$${\bf h}_1 = \sigma \big({\bf W}_1^T {\bf x}\big), \qquad {\bf h}_2 = \sigma \big({\bf W}_2^T {\bf x}\big).$$In all cases considered in this paper, $L_1 = L_2 \ll D$ , so the two linear maps ${\bf W}_i^T \in {\hbox{\rlap{I}\kern 2.0pt{\hbox{R}}}}^{L_i \times D}$ have full row rank, but a $D-L_i$ -dimensional null space. The $D$ -dimensional row spaces of ${\bf W}_i$ , which we shall denote ${\cal R}_i$ , can be used to characterize the linear map—it identifies which dimensions of the input space are preserved (and implicitly, which are destroyed). To assess the degree to which these two hidden representations capture different views of the data, we took a pair of matrices ${\bf W}_1$ and ${\bf W}_2$ trained on the TIMIT task (Section 5.2), where $L_1 = L_2 = 90$ , and used two notions of similarity of subspaces to infer whether the T-DSN does indeed push toward extracting different views of the data.

C.1 Measuring “Mass” LostC.1.1 MethodTo measure the similarity between the two row spaces, we first consider the following score:

$${\rm sim}({\bf W}_1,{\bf W}_2) = \sum_i \sqrt{ \sum_j \big({\bf B}_{1i}^T {\bf B}_{2j} \big)^2},$$where ${\bf B}_{1i} \in {\hbox{\rlap{I}\kern 2.0pt{\hbox{B}}}}^D$ denotes the $i$ th basis vector in some basis of ${\cal R}_1$ , and ${\bf B}_{2j} \in {\hbox{\rlap{I}\kern 2.0pt{\hbox{R}}}}^D$ likewise denotes the $j$ th basis vector in some basis of ${\cal R}_2$ . Assuming ${\bf W}_i$ has singular value decomposition ${\bf U}_i {\bf \Sigma }_i {\bf V}_i^T$ , then in detail the process of computing ${\rm sim}$ is as follows:

- Take the $i$ th basis vector ${\bf B}_{1i}$ and translate it into ${\cal R}_2$ via ${\bf U}_2$ , producing ${\bf U}_2^T {\bf B}_{1i}$ .
- Take the $\ell_2$ norm of ${\bf U}_2^T {\bf B}_{1i}$ . If $\Vert {\bf U}_2^T {\bf B}_{1i} \Vert_2$ equals 1, then no mass has been lost during the change of basis; i.e., ${\bf B}_{1i}$ is in the rowspace, ${\cal R}_2$ . If it is equal to 0, then all mass has been lost; i.e., ${\bf B}_{1i}$ is in the null space of ${\bf W}_2$ .
- ${\rm sim}$ is the sum over $i$ of the scores computed in 2 above, aggregating the amount of “mass” that was preserved or lost in the change of basis, and thus characterizing the similarity between the two spaces.

The maximum value ${\rm sim}$ can attain is $D$ if ${\cal R}_1 = {\cal R}_2$ ; the minimum value it can attain is 0 if the two spaces are orthogonal.

C.1.2 ResultsFor our pair of learned matrices, ${\bf W}_1, {\bf W}_2 \in {\hbox{\rlap{I}\kern 2.0pt{\hbox{R}}}}^{785 \times 90}$ , we measured ${\rm sim}$ and obtained the result of 30.41 (out of 90 possible). We also randomly generated 10,000 pairs of orthonormal basis for 90D subspaces of ${\hbox{\rlap{I}\kern 2.0pt{\hbox{R}}}}^{785}$ , and computed the corresponding ${\rm sim}$ values. On average, sim was 31.09, with a standard deviation of 0.21, putting ${\rm sim}({\bf W}_1,{\bf W}_2)$ over three standard deviations below the mean. Thus, while the two mappings do not capture mutually exclusive information from in the input, they extract more dissimilar information than would be expected by random chance, suggesting that they do in fact push toward extracting complementary views of the data.

C.2 Measuring Canonical AnglesWe can also use canonical angles to assess subspace similarity.

C.2.1 MethodThe cosine of the $k$ th canonical angle can be defined recursively as the largest inner product between a basis vector of ${\cal R}_1$ and a basis vector of ${\cal R}_2$ , excluding all basis vectors that have already been “used” in previous steps (e.g., that were part of a larger inner product). The cosine of canonical angles $\theta_1, \theta_2, \ldots \theta_D$ between two $D$ -dimensional subspaces can be computed as the singular values of ${\bf U}_1^T {\bf U}_2$ , where the columns of ${\bf U}_i$ form a basis of ${\cal R}_i$ . One way to measure similarity is to sum the cosines of canonical angles; this notion of similarity too has a maximum value of $D$ (if the spaces are equal) and a minimum value of 0 (if the spaces are orthogonal).

C.2.2 ResultsFor the same particular pair of estimated matrices, ${\bf W}_1,{\bf W}_2 \in {\hbox{\rlap{I}\kern 2.0pt{\hbox{R}}}}^{785 \times 90}$ , we measured the sum of canonical angles and obtained the result of 26.20 (out of 90 possible). We also randomly generated 10,000 pairs of orthonormal basis for 90D subspaces of ${\hbox{\rlap{I}\kern 2.0pt{\hbox{R}}}}^{785}$ , and computed the corresponding sum-of-canonical-angles values. On average, this was 26.71, with a standard deviation of 0.20, putting ${\rm sim}({\bf W}_1,{\bf W}_2)$ roughly 2.5 deviations below the mean. This also suggests that the two subspaces are being pushed to capture different information, but again, they are far from orthogonal.

C.3 DiscussionThis analysis provides evidence in support of the notion that the T-DSN learns “different” views of the data. These different views capture a more diverse set of dimensions in the input that would occur by random chance, but are far from orthogonal.

The authors would like to thank Karthik Mohan for his valuable feedback on the analysis of the T-DSN model.

- 1. L. Deng, and D. Yu, “Deep Convex Networks: A Scalable Architecture for Speech Pattern Classification,”
*Proc. Ann. Conf. Int'l Speech Comm. Assoc.,*Aug. 2011. - 2. L. Deng, and D. Yu, “Deep Convex Networks for Image and Speech Classification,”
*Proc. ICML Workshop Learning Architectures,*June 2011. - 3. L. Deng, D. Yu, and J. Platt, “Scalable Stacking and Learning for Building Deep Architectures,”
*Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing,*2012. - 4. D.H. Wolpert, “Stacked Generalization,”
*Neural Networks,*vol. 5, no. 2, pp. 241-259, 1992. - 5. G. Dahl, M. Ranzato, A. Mohamed, and G. Hinton, “Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine,”
*Proc. Advances in Neural Information Processing Systems Conf.,*Dec. 2010. - 6. M. Ranzato, A. Krizhevsky, and G. Hinton, “Factored 3-Way Restricted Boltzmann Machines for Modeling Natural Images,”
*Proc. Int'l Conf. Artificial Intelligence and Statistics,*vol. 13, 2010. - 7. M. Ranzato, and G. Hinton, “Modeling Pixel Means and Covariances Using Factorized Third-Order Boltzmann Machines,”
*Proc. IEEE Conf Computer Vision and Pattern Recognition,*pp. 2551-2558, 2010. - 8. B. Hutchinson, L. Deng, and D. Yu, “A Deep Architecture with Bilinear Modeling of Hidden Representations: Applications to Phonetic Recognition,”
*Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing,*2012. - 9. G. Hinton, and R. Salakhutdinov, “Reducing the Dimensionality of Data with Neural Networks,”
*Science,*vol. 313, no. 5768, pp. 504-507, 2006. - 10. G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-Dependent Pre-Trained Deep Neural Networks for Large Vocabulary Speech Recognition,”
*IEEE Trans. Audio, Speech, and Language Processing,*vol. 20, no. 1, pp. 30-42, Jan. 2012. - 11. A. Mohamed, G. Dahl, and G. Hinton, “Acoustic Modeling Using Deep Belief Networks,”
*IEEE Trans. Audio, Speech, and Language Processing,*vol. 20, no. 1, pp. 14-22, Jan. 2012. - 12. Q. Le, M. Ranzato, R. Monga, M. Devin, G. Corrado, K. Chen, J. Dean, and A. Ng, “Building High-Level Features Using Large Scale Unsupervised Learning,”
*Proc. Int'l Conf. Machine Learning,*2012. - 13. Y. Bengio, “Learning Deep Architectures for AI,”
*Foundations and Trends in Machine Learning,*vol. 2, no. 1, pp. 1-127, 2009. - 14. D. Yu, and L. Deng, “Accelerated Parallelizable Neural Network Learning Algorithm for Speech Recognition,”
*Proc. 12th Ann. Conf. Int'l Speech Comm. Assoc.,*Aug. 2011. - 15. E. Weisstein, “Symmetric Bilinear Form,”
*MathWorld and Wikipedia,*2012. - 16. T.G. Kolda, and B.W. Bader, “Tensor Decompositions and Applications,”
*SIAM Rev.,*vol. 51, no. 3, pp. 455-500, Sept. 2009. - 17. D.M. Dunlavy, T.G. Kolda, and E. Acar, “Poblano v1.0: A Matlab Toolbox for Gradient-Based Optimization,” Technical Report SAND2010-1422, Sandia Nat'l Laboratories, Albuquerque, N.M., and Livermore, Calif., Mar. 2010.
- 18. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-Based Learning Applied to Document Recognition,”
*Proc. IEEE,*vol. 86, no. 11, pp. 2278-2324, Nov. 1998. - 19. D. Ciresan, U. Meier, J. Masci, L. Gambardella, and J. Schmidhuber, “Flexible, High Performance Convolutional Neural Networks for Image Classification,”
*Proc. 22nd Int'l Joint Conf. Artificial Intelligence,*2011. - 20. A. Mohamed, D. Yu, and L. Deng, “Investigation of Full-Sequence Training of Deep Belief Networks for Speech Recognition,”
*Proc. Ann. Conf. Int'l Speech Comm. Assoc.,*Sept. 2010. - 21. A. Mohamed, G. Dahl, and G. Hinton, “Deep Belief Networks for Phone Recognition,”
*Proc. NIPS Workshop Deep Learning for Speech Recognition and Related Applications,*Dec. 2009. - 22. K.F. Lee, and H.W. Hon, “Speaker-Independent Phone Recognition Using Hidden Markov Models,”
*IEEE Trans. Audio, Speech, and Language Processing,*vol. 37, no. 11, pp. 1641-1648, Nov. 1989. - 23. G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury, “Deep Neural Networks for Acoustic Modeling in Speech Recognition,”
*IEEE Signal Processing Magazine,*vol. 29, no. 6, pp. 82-97, Nov. 2012. - 24. L. Deng, D. Yu, and A. Acero, “Structured Speech Modeling,”
*IEEE Trans. Audio, Speech, and Langauge Processing,*vol. 14, no. 5, pp. 1492-1504, Sept. 2006. - 25. D.B. Paul, and J.M. Baker, “The Design for the Wall Street Journal-Based CRS Corpus,”
*Proc. Int'l Conf. Spoken Language Processing,*1992. - 26. D. Yu, L. Deng, and F. Seide, “Large Vocabulary Speech Recognition Using Deep Tensor Neural Networks,”
*Proc. Ann. Conf. Int'l Speech Comm. Assoc.,*2012. - 27. G. Tur, L. Deng, D. Hakkani-Tur, and X. He, “Toward Deeper Understanding: Deep Convex Networks for Semantic Utterance Classification,”
*Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing,*2012. - 28. D. Yu, S. Siniscalchi, L. Deng, and C. Lee, “Boosting Attribute and Phone Estimation Accuracies with Deep Neural Networks for Detection-Based Speech Recognition,”
*Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing,*2012. - 29. K.B. Petersen, and M.S. Pedersen, “The Matrix Cookbook,” http: matrixbook.com, 2008.

Brian Hutchinson received the BA degree in linguistics and the BS and MS degrees in computer science, all from Western Washington University, and the master's degree in electrical engineering from the University of Washington, where he currently working toward the PhD degree. He worked at Microsoft Research, Redmond, as a PhD intern in 2010 and 2011. His research interests include speech and language processing, optimization, pattern classification, and machine learning. In particular, he is interested in matrix and tensor rank minimization in the context of language processing. He is a student member of the IEEE.

Li Deng received the PhD degree from the University of Wisconsin Madison. He joined the Department of Electrical and Computer Engineering, University of Waterloo, Ontario, Canada, in 1989, as an assistant professor, where he became a full professor with tenure in 1996. In 1999, he joined Microsoft Research (MSR), Redmond, Washington, as a senior researcher, where he is currently a principal researcher. Since 2000, he has also been an affiliate full professor and graduate committee member in the Department of Electrical Engineering at the University of Washington, Seattle. Prior to MSR, he also worked or taught at the Massachusetts Institute of Technology, ATR Interpreting Telecommunications Research Laboratories, Kyoto, Japan, and the Hong Kong University of Science and Technology. In broad areas of speech/language processing, signal processing, and machine learning, he has published more than 300 refereed papers in leading journals and conferences and three books. His recent technical work (since 2009) on industry-scale deep learning with colleagues and collaborators has created significant impact on speech recognition, signal processing, and related applications with high practical value. He has been granted more than 60 US or international patents. He served on the Board of Governors of the IEEE Signal Processing Society (2008-2010). More recently, he served as an editor-in-chief for the *IEEE Signal Processing Magazine* (2009-2011), which ranked first in both years among all publications within the Electrical and Electronics Engineering Category worldwide in terms of its impact factor and for which he received the 2011 IEEE SPS Meritorious Service Award. He currently serves as editor-in-chief of the *IEEE Transactions on Audio, Speech and Language Processing*. He is a fellow of the IEEE, the Acoustical Society of America, and ISCA.

Dong Yu received the BS degree (with honors) in electrical engineering from Zhejiang University, China, the MS degree in computer science from Indiana University at Bloomington, the MS degree in electrical engineering from the Chinese Academy of Sciences, and the PhD degree in computer science from the University of Idaho. He joined Microsoft Corporation in 1998 and Microsoft Speech Research Group in 2002, where he currently is a senior researcher. His current research interests include speech processing, robust speech recognition, discriminative training, spoken dialog system, machine learning, and pattern recognition. He has published more than 100 papers in these areas and is the inventor/co-inventor of close to 50 granted/pending patents. He is currently serving as an associate editor of the *IEEE Transactions on Audio, Speech, and Language Processing* (2011-) and has served as an associate editor of the *IEEE Signal Processing Magazine* (2008-2011), and as the lead guest editor of the *IEEE Transactions on Audio, Speech, and Language Processing* special issue on deep learning for speech and language processing (2010-2011). He is a senior member of the IEEE.

CITATIONS