^{1}], [

^{2}]. However, for many text mining applications, rich features give rise to extremely high-dimensional feature spaces that may render the learning algorithms intractable. Therefore, the removal of redundant information from large-scale features becomes a necessary preprocessing step in many NLP and text mining tasks. Relevant reduction techniques commonly used in NLP include feature selection using wrapper or filter models [

^{3}], [

^{4}], feature clustering [

^{4}], [

^{5}], or latent variable models [

^{6}], [

^{7}].

^{8}], [

^{9}], [

^{10}], [

^{11}], [

^{12}], [

^{13}], [

^{14}], [

^{15}], [

^{16}], [

^{17}], [

^{18}], [

^{19}]. Algorithms of this type generate lower dimensional embeddings which preserve certain properties and characteristics of the original high-dimensional feature space, such as pairwise proximity information based on local neighborhood graphs. Typically, they operate in an unsupervised manner as they do not utilize any label or output information. Although such unsupervised dimensionality reduction (UDR) provides a compact representation of the data when it is used as a preprocessing step serving a supervised classification or regression task, it may not always facilitate its final performance. This is because real-world data sets are often noisy and may possess incompatibilities between the input features and the output labels (e.g., samples closely/distantly located in the feature space are from different/same classes). Additionally, because such methods are based on local similarity statistics, these problems are often exacerbated.

^{19}], [

^{20}], [

^{21}], [

^{22}], [

^{23}], [

^{24}], [

^{25}], or modify various UDR methods by incorporating label information to the proximities governing the embeddings [

^{26}], [

^{27}], [

^{28}], [

^{29}], [

^{30}]. For SSDR, many existing methods combine together the proximity information from the SDR and UDR algorithms [

^{29}], [

^{31}], [

^{32}]. A broader review on semi-supervised methods can be found in [

^{33}].

^{34}], [

^{35}], multilabel SDR/SSDR have attracted increasing interest with various algorithms proposed. These are based on combining the feature and label information [

^{36}], [

^{37}], maximizing label correlations [

^{38}], [

^{39}], using hypergraph modeling [

^{40}], or optimizing statistical criteria between embeddings and label vectors [

^{41}], [

^{42}], [

^{43}], [

^{44}], [

^{45}], [

^{46}], [

^{47}].

^{48}] [

^{49}] and maximum entropy unfolding [

^{50}], which explicitly model the data distributions and learn the covariance matrix, have also been proposed. The advantages of generative methods are that they can recover the manifold and generalize from fewer and more noisy training samples. They perform the mapping of the latent space to the high-dimensional data space using smoothness constraints to preserve global or local proximities. On the other hand, spectral methods support more explicit and flexible modeling of the mapping of the data to the embedding space. They are more direct to optimize and incorporate data constraints. A recent analysis on these can be found in [

^{51}].

^{52}]:

(1)

^{6}]. The method defines a orthogonal projection matrix to enable optimal reconstruction in terms of an error defined via the Frobenius norm:

(2)

(3)

(4)

^{14}] and is solved with eigen-decomposition of . Alternatively, the normalized spectral clustering (NSC) [

^{12}] and Laplacian eigenmaps (LE) [

^{16}] define the constraints . This is also equivalent to using the Laplacian and is solved as a generalized eigenvalue problem. Another version is the symmetric Laplacian and assumes orthogonality on [

^{13}], [

^{14}].

^{53}]:

(5)

^{15}].

^{10}]. Subsequently, all embeddings are recovered to comply with the reconstruction weights by solving the optimization problem:

(6)

^{17}] and orthogonal LPP (OLPP) [

^{11}] are such linear variations of LE, while neighborhood preserving projections (NPP) and orthogonal NPP (ONPP) [

^{11}] of LLE. OLPP and ONPP directly impose orthogonality on the projection matrix, while NPP requires and LPP either the last constraint or .

**2.1.1 Feature-Based Weight Matrix**In general, the weight matrix used by the above algorithms is controlled by an adjacency graph. Each weight is nonzero only for adjacent nodes in the graph. There are two principal ways to define this adjacency:

There are various alternatives to defining :

Constant values if the th and th samples are adjacent, and otherwise.

The Gaussian kernel, giving

(7)

Local-scaling Gaussian kernel [ ^{54}], with

(8)

and denotes the th nearest neighbor of .

An alternative weight definition [ ^{27}], based on

(9)

The optimal weight reconstruction matrix of LLE [ ^{10}], as mentioned earlier.

Any user-defined domain-dependent set of similarities between the samples [ ^{55}].

Any of the classical cosine norm, Pearson's, Spearman's or Kendall's correlation coefficients, etc.

Weights depending on distances based on geodesic distances as in Isomap [ ^{15}] and related updating schemes [ ^{56}].

**2.2.1 Single-Label Classification**Fisher discriminant analysis (FDA) [

^{20}] is the most commonly used linear SDR method for single-label classification, employing the projection . It computes the optimal projection matrix as

(10)

where and are the between-class and within-class scatter matrices. FDA motivates a group of other related SDR techniques that pursue embeddings or projections by employing different types for setting up the between- and within-class matrices. For instance, by incorporating the local structure of the data into the Fisher criterion, local FDA (LFDA) combines the ideas of LPP and FDA [ ^{24}]. LFDA solves the same optimization problem as in (10), but with and redefined as in [ ^{24}] using local information. Mariginal Fisher analysis (MFA) is another variant of FDA that also considers the local data structure [ ^{23}], and solves (10), with and based on intraclass -NNs and interclass -nearest pairs ( -NPs).

With the same goal as FDA, the maximum margin criterion (MMC) is proposed in the form of , instead of used by Fisher criterion. The MMC-based supervised embeddings can be obtained by solving the following optimization problem [ ^{21}]:

(11)

A regularized version of MMC in the form of , is sometimes preferred for implementation with [ ^{31}]. Discriminative locality alignment (DLA) is a recently proposed SDR method, defined by Zhang et al. [ ^{19}]:

(12)

where the data points denote the intraclass -NNs of , and denote the interclass -NNs of . By viewing the first term of (12) as a modified within-class scatter constructed from intraclass neighbors and the second term as a modified between-class scatter constructed from interclass neighbors, DLA minimizes a criterion in the form of . This can be viewed as a modified MMC using the local structure of data.

Another group of SDR techniques has been developed by incorporating the label information into OLPP. Discriminant neighborhood embedding (DNE) replaces the weight matrix of OLPP with the following [ ^{26}]:

(13)

This also achieves minimization of distances between the intraclass neighbors, while maximization of distances between the interclass neighbors is achieved as DLA does. Repulsion OLPP (OLPP-R) is an SDR method which replaces the Laplacian in OLPP with the linear combination of a repulsion and a class Laplacian [ ^{27}]. The two weight matrices and are set to have those nonzero elements corresponding to pairs of points from the same class and nearby pairs from different classes, respectively. The final Laplacian of OLPP-R is , where is a user-defined parameter.

The unsupervised ONPP method has also been adapted for SDR. For example, the supervised ONPP (SONPP) [ ^{11}] uses LLE-style reconstruction weights calculated for all samples from the same class. Discriminative ONPP (DONPP) [ ^{29}] uses slightly different reconstruction weights from SONPP, calculated by assuming each sample is reconstructed from the remaining samples from its class, to minimize the embedding distances, and, at the same time, it maximizes the distances of the embeddings corresponding to nearby samples from different classes. The repulsion ONPP (ONPP-R) [ ^{27}] uses a similar weight matrix to SONPP, but incorporates a repulsion Laplacian as OLPP-R.

**2.2.2 Multilabel Classification**Given a multilabel classification data set, most SDR algorithms model the class (target) information as an output matrix [

^{37}], [

^{42}], [

^{43}], [

^{44}], of which each element if the th sample belongs to the th class , and (or sometimes) otherwise.

Classical SDR algorithms for multilabel classification compute embeddings by optimizing a statistical criterion between the embeddings and multiple labels. Canonical correlation analysis (CCA), for example, maximizes the correlation coefficient between the projected features and the labels [ ^{41}], [ ^{43}] as

(14)

where and are the centered feature and label matrices. The regularized CCA (rCCA) modifies the constraint of the original CCA to to prevent overfitting and avoid the singularity of , where is the user-defined regularization parameter [ ^{43}], [ ^{57}]. Partial least squares (PLS) maximizes the covariance between the projected features and the labels [ ^{42}], [ ^{58}]:

(15)

The orthonormalized PLS (OPLS) changes the constraint of (15) to [ ^{42}], [ ^{59}]. Both the multilabel dimensionality reduction via dependence maximization (MDDM) [ ^{44}] and supervised feature extraction using Hilbert-Schmidt norm (SFEHS) [ ^{46}] compute the optimal embeddings by maximizing the Hilbert-Schmidt independence criterion (HSIC) between the projected features and the labels. Using the HSIC defined in [ ^{60}], MDDM and SFEHS solve the following optimization problem:

(16)

where can be either or the kernel matrix computed from the label vectors . Using the HSIC defined in [ ^{61}], SFEHS solves a similar optimization problem to (16), but with replaced by a different symmetric matrix (see [ ^{46}]).

The hypergraph spectral learning (HSL) method [ ^{40}] achieves multilabel SDR by modifying LPP, which replaces the Laplacian matrix of LPP with a hypergraph-based laplacian matrix , where is an similarity matrix computed from a hypergraph modeling the multilabel class information of the training data set. Three different ways are provided in [ ^{40}] to compute this similarity matrix, including clique expansion, star expansion, and the hypergraph Laplacian [ ^{62}], corresponding to HSL1, HSL2, and HSL3, respectively.

The supervised latent variable model (SLVM) is another SDR algorithm for multilabel classification aiming the optimal reconstruction of both feature and label matrices in the embedded space. It is based on the optimization [ ^{36}], [ ^{37}]

(17)

where is a degree parameter determining how much the embeddings should be biased by the label information. By centering both feature and label matrices, applying the projection technique, and introducing the Tikhonov regularization, the multi-output regularized feature projection (MORP) is also proposed in [ ^{37}]:

(18)

^{25}]. It solves the optimization in (10), but with and defined as

(19)

(20)

^{25}]. Song et al. [

^{31}] propose to combine FDA and MMC, with OLPP to achieve semi-supervised FDA (SSFDA) and semi-supervised MMC (SSMMC), respectively. SSFDA solves the optimization in (10), but modifies the within-class scatter as

(21)

(22)

^{32}] to combine MFA, DNE, and LFDA with OLPP, and [

^{29}] to combine DONPP with ONPP.

^{38}], multilabel Gaussian random field (ML-GRF) [

^{39}] and multilabel local and global consistency (ML-LGC) [

^{39}]. Letting , , and denote the feature, label, and embedding matrices of the labeled samples and the embedding matrix for the unlabeled samples, this group of methods attempts to achieve the following: 1) best predict the label information from the embeddings , 2) preserve the feature-based adjacency between all samples, and 3) drive to preserve a certain similarity between classes. SMSE1 solves the following optimization problem:

(23)

(24)

(25)

^{38}], [

^{39}].

^{63}] is one of the earliest such works that shows connections between PCA, LPP, and LDA regarding the calculation of their weight matrices used to build the Laplacian ones, and their unsupervised or supervised setup. In [

^{64}], the commonality of LLE, Isomap, LE, PCA, KPCA and MDS was examined in terms of the underlying eigenfunction estimation these methods perform.

^{23}] advocates the general form

(26)

^{65}] also compares different methods, such as the unsupervised LLE, LE, PCA, MDS, Isomap, LPP, ONPP, NPP, OLPP, the supervised LDA, and the supervised versions of NPP and LPP in terms of (26) and discusses their kernel formulations.

^{19}] based on the design proposed in [

^{66}]. A patch corresponds to a measurement and its related neighbors. Letting denote an binary selection matrix that indicates point membership for the th patch and the number of patches, the framework performs global optimization according to

(27)

Table 1. Existing Dimensionality Reduction Algorithms Expressed in Terms of the Templates of (30), (31)

^{23}], [

^{63}], Kokiopoulou et al. [

^{65}]. However, in some cases it is not possible to enforce the zero row-sum constraint for (or ) and view it as a Laplacian matrix. For example, the symmetric Laplacian [

^{14}], many multilabel SDR methods, such as HSL [

^{40}], SLVM [

^{37}], and SFEHS [

^{46}], do not require . Regarding the patch alignment framework of [

^{19}], it can represent many models, but it is not as intuitive as the trace optimization, and can allow nonunique representations of the same model.

(28)

(29)

(30)

(31)

^{67}], [

^{68}], [

^{69}], using various copy, select, and ignore transformations. Here, however, sample duplication is implemented through an efficient mechanism, and is only employed for the calculation of the weights and not the classification stage.

(32)

(33)

Table 2. Proposed Dimensionality Reduction Algorithms Expressed in Terms of the Templates of (30), (31)

**G**.

**3.3.1 Label-Based Proximity Matrix**In order to evaluate the similarity between label vectors, we propose three different schemes to compute the proximity matrix by: 1) working in the binary label space , 2) working in a transformed real label space, and 3) utilizing class similarity information.

*Scheme 1*. Using a bit-string-based similarity to capture the proximity structure between label vectors is the most direct way to build . Examples include the MDDM (see Table 1) of which one configuration employs an and-based similarity (i.e., the number of the common sample labels) for centered samples, and HSL which scales the and-based similarity by class membership (this gives a cosine norm on binary vectors) and weighs it with class sizes. In the following, we provide alternative ways to measure similarities:

Søensen's similarity coefficient, also known as Dice's coefficient:

(34)

This is equivalent to scaling the and-based similarity with . We can also define a version of a scaled Søensen's coefficient, obtained by weighting with the class sizes , as

(35)

Jaccard similarity coefficient

(36)

This can be viewed as the number of common classes scaled by the inverse of the total number of different classes and belong to.

Hamming-based similarities

(37)

(38)

While the previous and-based indices rely on the importance of the number of “shared” classes with different scalings, a similarity index based on the Hamming-distance is based the number of “distinct” classes.

*Scheme 2*. We can also seek the latent similarity between binary label vectors in a transformed and more compact real space. In the first stage, we map each -dimensional binary label vector to a -dimensional real space ( ) and obtain a set of transformed label vectors . Of the many ways for achieving this, one is to employ a projection technique that maximizes the variance of the projections as

(39)

This is actually PCA in the binary label space, mapping the -dimensional label vectors into a more compact number of uncorrelated directions. The mappings are generated by the top right singular vectors of . Other ways to generate the would be to use any of the UDR algorithms described in Section 2.1 or their corresponding kernel versions with as the input matrix.

In the second stage of Scheme 2, the following similarity measures can be employed to capture the label-based proximities between the samples :

Minkowski-based similarity

(40)

This is computed from the Minkowski distance passed through a Gaussian to map to a similarity value.

Tanimoto similarity coefficient

(41)

This is an extended cosine similarity (becoming the Jaccard coefficient when applied to binary vectors).

Existing feature-based similarity approaches: Any of the techniques from Section 2.1.1 used to compute the weights can also be employed here.

*Scheme 3*. In this scheme, we consider the strong degree of dependence the classes may possess due to many samples belonging to multiple classes. A flexible way of taking this into account is to define the similarity between any two label vectors as

(42)

or equivalently in matrix form (see also Table 2). Equation (42) averages the similarities of pairs of classes either the th or the th sample belong to. is the similarity matrix between classes and can be constructed via different mechanisms directly from the label matrix . The most natural is , where is the number of samples shared by the th and th classes. Other more sophisticated ways to compute can be implemented. For example, by representing each class with an -dimensional binary vector , all similarity indices presented in Schemes 1 and 2 can be used to obtain , but with used as the input instead. For some domain-specific classification problems, one can also compute from a feature space where each class is directly representable with a set of domain-specific features. For example, in document classification each document is represented using different words with features being the number of times each word occurs in each document. The task is to assign the document to different categories, such as sports, politics, and science. In this case, one can represent each category also with subsets of words, and class features can either be the number of times the word occurs in those documents belonging to a specific category, given as , or binary indicators showing whether the word occurs in those documents belonging to that specific category. Subsequently, can be computed from these word features using the techniques from Section 2.1.1.

*Discussion.* The proximity matrix obtained from the three proposed schemes are formulated in the bottom of Table 2. Scheme 1 directly computes the similarities from the original label vectors using a string-based measure, while Scheme 2 from a set of projected label vectors using a real vector-based similarity estimation. Within each scheme, different measures would probably lead to with similar overall structures, but with different matrix element values reflecting the type and degree of “closeness” between the embedded points. It should be mentioned that when the problem at hand has a large number of classes, such as text categorization with large taxonomies [ ^{70}], the label matrix is usually very sparse due to the lack of training samples for some classes. In this case, Scheme 2 is preferred over Scheme 1, as the projected label vectors provide a more compact, simplified, and robust representation with reduced noise. Compared to Schemes 1 and 2, Scheme 3 is more domain oriented because it incorporates class dependence, and so it potentially models the label-based proximity in a more precise manner, which could be beneficial in applications where class memberships are not independent.

**3.3.2 A Priority-Based Combination Mechanism**Ideally, if the features can accurately describe all the discriminating characteristics, the proximity structures computed from features and labels should be very similar. However, when processing real data sets, certain characteristics, such as noneasily separable distributions or cases where patterns closely/distantly located in the feature space are from different/same classes, may lead to incompatible proximities in the feature (Section 2.1.1) and label (Section 3.3.1) spaces. This may lead to increased classification errors.

Given two proximity matrices and representing the two types of proximity information sources, most of the existing algorithms (e.g., SLVM and OLPP-R, discussed in the beginning of Section 3.3) although they allow the users to control the degree of preference with a weight parameter, they assume that these two types are equally important. However, given a classification task, the label information of the training samples constitutes a more dominant and driving force of the problem. Also, the feature information is usually more susceptible to noise and imperfections of the employed feature extraction or preprocessing stages. Therefore, to equip the proposed MOPE framework with a more robust information combining mechanism to calculate , we consider the feature-based proximity as secondary source of information, while the label-based proximity as a primary one. To implement this setup we propose using the combining function:

(43)

where is the combined similarity between two data points, and the two information sources do or can be scaled to satisfy . This function is a monotonically increasing function of and and is controlled by three parameters: to set the lower bound, and to control the rate of interaction of and . Equation (43) is designed so that the contribution of is primarily restricted by and on its own cannot increase too much unless is adequately high to support it. In this way, gives priority to over . The overall behavior of (43) is seen from

and also from the example of Fig. 1. The hyperparameters , , and can be set by the user or tuned by cross-validation procedures to adjust the interaction of the proximity information to fit the problem at hand. When , the proximity matrix is only computed from the class information, as in CCA, PLS, MDDM, HSL, MMC, etc. Similar mechanisms for combining sources of information with different priorities have also been used in [ ^{71}] in different context. More powerful combining functions could also be designed to generalize or extend (43), e.g., with , which supports more options besides the priority-based mechanism, such as on its own ( and ) and weighted sum of and ( , , and ).

We calculate **W** as in Section 2.1.1 for all sample pairs and using a method from Schemes 1-3; then, the combined similarities are obtained as described above. Finally, we define using a -NN search as in Section 2.1.1, using as edge weights either constant values or the similarities.

(44)

(45)

^{32}], which can be considered a special case of the proposed one. This can be seen by using (31) as the objective, with defined as in (44), but with the constraint matrix obtained by setting in (45) to get .

^{11}], [

^{23}], [

^{64}]. We can apply the standard kernel trick [

^{72}] to the formulation of (30), through the use of kernel function , which defines the dot product in a high-dimensional (possibly infinite dimensional) feature space known as kernel-induced space [

^{73}], [

^{74}]. Letting denote the mapping that transforms the original data to the kernel-induced space , then is the feature matrix in . Further, if denotes the kernel matrix between the data points, then . Working in , we are looking for a transformation matrix to project the kernel-induced features to a -dimensional subspace, where the embeddings are computed by . The classical kernel trick expands each of the transformation vectors to a linear summation of all the training samples in , given as . Letting denote the coefficient matrix, we have . By incorporating this into (30), the optimal coefficients can be obtained by solving

(46)

^{75}]. It is also possible to select a collection of ( ) seed prototypes instead of using all the training samples, in order to compute relation features. In this case, the computational cost can be further reduced to ( ). Since previous research [

^{76}], [

^{77}] has shown that (dis)similarity values can be used as input features to build good classifiers, we can expect the discriminating ability of the embeddings computed from these relation features to be similar to that of the original features.

**Reuters documents.** The “Reuters-21578 Text Categorization Test Collection” contains articles taken from the Reuters newswire, ^{1} where each article is designated into one or more semantic categories. A total number of 9,980 articles from 10 overlapped categories were used in our experiments, with each category having between 400 and 4,000 articles. We randomly divide the articles from each category into three partitions with nearly the same size, for the purpose of training, validation, and test. This leads to 3,328 articles for training and 3,326 articles for validation and test, respectively. There are around 18 percent of these articles belonging to two to four different categories at the same time, while each of the rest belong to a single category.

**Education evidence portal (EEP) documents.** A collection of documents supplied by EEP, ^{2} where each document is a quite lengthy full paper or report (approximately 250 KB on average after converting to plain text). Domain experts have developed a taxonomy of 107 concept categories in the area and manually assigned categories to documents stored in the database. This manual effort has resulted in 2,149 documents, including 1,928 training documents and 221 test documents, with mostly multiple categories assigned. Among these 107 categories, the five biggest ones contain more than 500 documents each, while most of the remaining ones contain between 1 to 300 documents. There are around 96 percent of these documents assigned 2 to 17 different categories, while each of the remaining go to one only category.

^{3}to the documents, then extracted word unigrams from each document. For the Reuters documents, after filtering the low-frequency words, the term frequency-inverse document frequency (tf-idf) values of 24,012 word unigrams are used as the original features. This leads to a feature matrix for the training samples and a feature matrix for the query samples, in both the validation and test procedures. For the EEP documents, the corresponding frequencies of the word unigrams, representing the number of times the terms occur in the documents, are used as the original features. This leads to a feature matrix for the training samples and a feature matrix for the test samples.

^{4}The discriminating ability of the same number of embeddings computed from the same set of relation features was evaluated by the same classifier. Two sets of experiments were conducted:

**Experiment 1** was conducted for SDR. The proposed MESD extension was applied to three single-label SDR methods, including FDA, LFDA, and OLPP-R. The three schemes to compute the label-based proximity matrix of MOPE were also compared. The proposed combination mechanism in (43) was compared with the Hadamard multiplication used by LFDA and the weighted sum used by SLVM, with the same configuration of and . The proposed methods were also compared with three existing UDR methods (LSI, PCA, and OLPP) and six existing multilabel SDR methods (MDDM, CCA, the three versions of HSL, and MORP).

**Experiment 2** was conducted for SSDR. Only part of the training samples were used as labeled data and only the resulting embeddings of the labeled data were used to train the classifier. We combined OLPP and the best performing MOPE instance from experiment 1, using the proposed SSDR framework, leading to semi-supervised MOPE (SSMOPE). The results were compared with those obtained by solely using OLPP or MOPE, as well as the existing semi-supervised multilabel algorithm ML-LGC.

**4.3.1 Experiment 1**First, we compare the label-based proximity matrices for MOPE computed with the proposed schemes, with the label-based matrices used by CCA, the two versions of HSL with the highest and lowest classification errors, as well as the feature-based proximity matrix used by OLPP. Figs. 2 and 3 present a visual qualitative comparison between these matrices and also include their corresponding numerical classification performances. To delineate the potential class structure of these proximity matrices, we have reordered the training samples in a way that samples with more similar label vectors are grouped together; to do that we applied the data seriation algorithm coVAT [

^{78}] on the label matrix . It can be seen that the feature-based proximity matrix of the EEP documents in Fig. 3a does not possess obvious class structure and is incompatible to many of the label-based proximity matrices in Fig. 3. On the contrary, the matrix of Reuters documents in Fig. 2a possesses roughly similar class structure to those matrices in Fig. 2, though less distinct. This indicates that the EEP set includes more noisy features than the Reuters set. It can be seen from Figs. 2 and 3 that compared to the label-based proximity matrices used by CCA and HSL, our proposed schemes generate proximity matrices with sharper class structures and possess comparatively higher classification performance, especially for the EEP set. Among the three proposed schemes, Scheme 3 works the best for both data sets, and therefore it is used to compute the combined proximity matrix for MOPE.

We also compare the performance of various existing and proposed SDR and UDR methods in Fig. 4, and MOPE equipped with the proposed combination mechanism as well as two existing ones (Hadamard and weighted sum; see Table 3). It can be seen that the embeddings obtained by SDR possess better discriminating ability than those obtained by UDR, but with the existing SDR method MDDM not performing well for either data sets. All the multilabel SDR algorithms derived from the MESD and MOPE frameworks provide performance similar to or better than existing methods. The top three best-performing algorithms are all generated by MOPE and MESD for both data sets (see Fig. 4). The proposed combination mechanism for MOPE performs better than the Hadamard and weighted sum ones, which sometimes even reduce the performance obtained by using or alone, as seen in Table 3. The performance improvement achieved by using advanced SDR methods (e.g., HSL, MESD, and MOPE) is higher for the EEP data set (3 to 7 percent) than Reuters (0.2 to 1.2 percent), as EEP seems to suffer from the noisy features.

Table 3. Comparison of Different Combination Schemes Using the Score

We also demonstrate the reduction of computational cost by using the relation features instead of the original ones. With methods expressed by (31) based on linear projections, one needs to decompose or compute the inverse of a matrix, where the original features are for the EEP and for the Reuters set. This makes it highly impractical to perform the classification task. By using the relation features, the computational cost is only related to the number of training samples . This corresponds to less than 300 secs for both data set, and no more than 2,500 secs for the comparatively larger Reuters set when generalized eigen-decomposition is involved.

**4.3.2 Experiment 2**The SSDR performance for each data set averaged over all classes, is shown in Table 4 where different percentages of the training samples are used as the labeled data (40 or 80 percent). It can be seen that the proposed SSDR framework of Section 3.4 (whose instance chosen here, referred to as SSMOPE, combines the UDR method OLPP and an SDR instance derived from the MOPE framework of Section 3.3) outperforms the existing ML-LGC. As can be seen by the 40 percent rows of the table, when there is not enough labeled data available, the embeddings computed from both labeled and unlabeled data can possess better discriminating ability than those computed from only labeled data, as in MOPE, or by ignoring label information altogether, as in OLPP here.

Table 4. SSDR Performance in Score for Both Data Sets

# Acknowledgments

*T. Mu and S. Ananiadou are with the National Centre for Text Mining (NaCTeM), School of Computer Science, University of Manchester, MIB Building, 131 Princess Street, Manchester M1 7DN, United Kingdom.*

*E-mail: tingtingmu@me.com, sophia.ananiadou@manchester.ac.uk.*

*J.Y. Goulermas is with the Department of Electrical Engineering and Electronics, University of Liverpool, Brownlow Hill, Liverpool L69 3GJ, United Kingdom. E-mail: j.y.goulermas@liverpool.ac.uk.*

*J. Tsujii is with Microsoft Research Asia, Beijing, China.*

*E-mail: jtsujii@microsoft.com.*

*Manuscript received 18 Oct. 2010; revised 6 Aug. 2011; accepted 25 Dec. 2011; published online 9 Jan. 2012.*

*Recommended for acceptance by K. Murphy.*

*For information on obtaining reprints of this article, please send e-mail to: tpami@computer.org, and reference IEEECS Log Number TPAMI-2010-10-0799.*

*Digital Object Identifier no. 10.1109/TPAMI.2012.20.*

1. http://archive.ics.uci.edu/ml/support/Reuters-21578+Text+Cate gorization+Collection.

2. http://www.eep.ac.uk.

3. http://tartarus.org/~martin/PorterStemmer/.

4. Our code can be downloaded from http://pcwww.liv.ac.uk/ ~goulerma/software/mesd-mope.zip.

#### References

**Tingting Mu**received the BEng degree in electronic engineering and information science from the University of Science and Technology of China, Hefei, China, in 2004, and the PhD degree in electrical engineering and electronics from the University of Liverpool, Liverpool, United Kingdom, in 2008. She is currently a postdoctoral researcher with the School of Computing, Informatics and Media, University of Bradford, United Kingdom. Her current research interests include machine learning and pattern recognition, with applications to text mining, bioinformatics, biomedical engineering, and intelligent transportation systems. She is a member of the IEEE.

**John Yannis Goulermas**received the BSc degree (first class) in computation from the University of Manchester Institute of Science and Technology (UMIST), Manchester, United Kingdom, in 1994, and the MSc degree by research and the PhD degree from the Control Systems Center, Department of Electrical and Electronic Engineering, UMIST, in 1996 and 2000, respectively. He is currently a reader in the Department of Electrical Engineering and Electronics, University of Liverpool, Liverpool, United Kingdom. His current research interests include machine learning, mathematical modeling/optimization, and image processing with application areas biomedical engineering and industrial monitoring and control. He is a senior member of the IEEE.

**Jun'ichi Tsujii**is a principal researcher with Microsoft Research Asia, China. Until March, 2011 he was a professor of text mining in the School of Computer Science, University of Manchester, and a professor of computer science at the University of Tokyo. He has worked since 1973 in NLP, question answering, text mining, and machine translation. His recent research achievements include: deep semantic parsing based on feature forest model, efficient search algorithms for statistical parsing, improvement of estimator for maximum entropy model, and construction of the gold standard corpus (GENIA) for biomedical text mining. He has authored more than 300 papers in journals, conferences, and books. He was the president of the Association for Computational Linguistics (ACL) in 2006 and has been a permanent member of the International Committee on Computational Linguistics (ICCL) since 1992. He received the IBM Science Award (1988), the IBM Faculty Award (2005), and the Medal of Honor with Purple Ribbon from the Government of Japan in 2010.

**Sophia Ananiadou**is a director of the National Centre for Text Mining (NaCTeM) and a professor in Text Mining in the School of Computer Science, University of Manchester. She is the main designer of the text-mining tools and services currently used in NaCTeM, i.e., terminology management, information extraction, intelligent searching, and association mining. Her research projects include text mining-based visualization of biochemical networks, data integration using text mining, building biolexica and bio-ontologies for gene regulation, automatic event extraction of bioprocesses, as well as EC FP7 project MetaNet4U and research industrially funded by Pfizer/AstraZeneca/IBM/BBC. She has been awarded the Daiwa Adrian prize (2004) and the IBM UIMA innovation award (2006, 2007, 2008) for her leading work on text-mining tools in biomedicine. She has more than 160 publications in journals, conferences, and books.

| |||