2.1.1 Feature-Based Weight Matrix In general, the weight matrix used by the above algorithms is controlled by an adjacency graph. Each weight is nonzero only for adjacent nodes in the graph. There are two principal ways to define this adjacency:
The Gaussian kernel, giving
Local-scaling Gaussian kernel [ 54], with
An alternative weight definition [ 27], based on
The optimal weight reconstruction matrix of LLE [ 10], as mentioned earlier.
Any user-defined domain-dependent set of similarities between the samples [ 55].
Any of the classical cosine norm, Pearson's, Spearman's or Kendall's correlation coefficients, etc.
2.2.1 Single-Label Classification Fisher discriminant analysis (FDA) [ 20] is the most commonly used linear SDR method for single-label classification, employing the projection . It computes the optimal projection matrix as
where and are the between-class and within-class scatter matrices. FDA motivates a group of other related SDR techniques that pursue embeddings or projections by employing different types for setting up the between- and within-class matrices. For instance, by incorporating the local structure of the data into the Fisher criterion, local FDA (LFDA) combines the ideas of LPP and FDA [ 24]. LFDA solves the same optimization problem as in (10), but with and redefined as in [ 24] using local information. Mariginal Fisher analysis (MFA) is another variant of FDA that also considers the local data structure [ 23], and solves (10), with and based on intraclass -NNs and interclass -nearest pairs ( -NPs).
With the same goal as FDA, the maximum margin criterion (MMC) is proposed in the form of , instead of used by Fisher criterion. The MMC-based supervised embeddings can be obtained by solving the following optimization problem [ 21]:
A regularized version of MMC in the form of , is sometimes preferred for implementation with [ 31]. Discriminative locality alignment (DLA) is a recently proposed SDR method, defined by Zhang et al. [ 19]:
where the data points denote the intraclass -NNs of , and denote the interclass -NNs of . By viewing the first term of (12) as a modified within-class scatter constructed from intraclass neighbors and the second term as a modified between-class scatter constructed from interclass neighbors, DLA minimizes a criterion in the form of . This can be viewed as a modified MMC using the local structure of data.
Another group of SDR techniques has been developed by incorporating the label information into OLPP. Discriminant neighborhood embedding (DNE) replaces the weight matrix of OLPP with the following [ 26]:
This also achieves minimization of distances between the intraclass neighbors, while maximization of distances between the interclass neighbors is achieved as DLA does. Repulsion OLPP (OLPP-R) is an SDR method which replaces the Laplacian in OLPP with the linear combination of a repulsion and a class Laplacian [ 27]. The two weight matrices and are set to have those nonzero elements corresponding to pairs of points from the same class and nearby pairs from different classes, respectively. The final Laplacian of OLPP-R is , where is a user-defined parameter.
The unsupervised ONPP method has also been adapted for SDR. For example, the supervised ONPP (SONPP) [ 11] uses LLE-style reconstruction weights calculated for all samples from the same class. Discriminative ONPP (DONPP) [ 29] uses slightly different reconstruction weights from SONPP, calculated by assuming each sample is reconstructed from the remaining samples from its class, to minimize the embedding distances, and, at the same time, it maximizes the distances of the embeddings corresponding to nearby samples from different classes. The repulsion ONPP (ONPP-R) [ 27] uses a similar weight matrix to SONPP, but incorporates a repulsion Laplacian as OLPP-R.
2.2.2 Multilabel Classification Given a multilabel classification data set, most SDR algorithms model the class (target) information as an output matrix [ 37], [ 42], [ 43], [ 44], of which each element if the th sample belongs to the th class , and (or sometimes) otherwise.
Classical SDR algorithms for multilabel classification compute embeddings by optimizing a statistical criterion between the embeddings and multiple labels. Canonical correlation analysis (CCA), for example, maximizes the correlation coefficient between the projected features and the labels [ 41], [ 43] as
where and are the centered feature and label matrices. The regularized CCA (rCCA) modifies the constraint of the original CCA to to prevent overfitting and avoid the singularity of , where is the user-defined regularization parameter [ 43], [ 57]. Partial least squares (PLS) maximizes the covariance between the projected features and the labels [ 42], [ 58]:
The orthonormalized PLS (OPLS) changes the constraint of (15) to [ 42], [ 59]. Both the multilabel dimensionality reduction via dependence maximization (MDDM) [ 44] and supervised feature extraction using Hilbert-Schmidt norm (SFEHS) [ 46] compute the optimal embeddings by maximizing the Hilbert-Schmidt independence criterion (HSIC) between the projected features and the labels. Using the HSIC defined in [ 60], MDDM and SFEHS solve the following optimization problem:
where can be either or the kernel matrix computed from the label vectors . Using the HSIC defined in [ 61], SFEHS solves a similar optimization problem to (16), but with replaced by a different symmetric matrix (see [ 46]).
The hypergraph spectral learning (HSL) method [ 40] achieves multilabel SDR by modifying LPP, which replaces the Laplacian matrix of LPP with a hypergraph-based laplacian matrix , where is an similarity matrix computed from a hypergraph modeling the multilabel class information of the training data set. Three different ways are provided in [ 40] to compute this similarity matrix, including clique expansion, star expansion, and the hypergraph Laplacian [ 62], corresponding to HSL1, HSL2, and HSL3, respectively.
The supervised latent variable model (SLVM) is another SDR algorithm for multilabel classification aiming the optimal reconstruction of both feature and label matrices in the embedded space. It is based on the optimization [ 36], [ 37]
where is a degree parameter determining how much the embeddings should be biased by the label information. By centering both feature and label matrices, applying the projection technique, and introducing the Tikhonov regularization, the multi-output regularized feature projection (MORP) is also proposed in [ 37]:
3.3.1 Label-Based Proximity Matrix In order to evaluate the similarity between label vectors, we propose three different schemes to compute the proximity matrix by: 1) working in the binary label space , 2) working in a transformed real label space, and 3) utilizing class similarity information.
Scheme 1. Using a bit-string-based similarity to capture the proximity structure between label vectors is the most direct way to build . Examples include the MDDM (see Table 1) of which one configuration employs an and-based similarity (i.e., the number of the common sample labels) for centered samples, and HSL which scales the and-based similarity by class membership (this gives a cosine norm on binary vectors) and weighs it with class sizes. In the following, we provide alternative ways to measure similarities:
Søensen's similarity coefficient, also known as Dice's coefficient:
Jaccard similarity coefficient
While the previous and-based indices rely on the importance of the number of “shared” classes with different scalings, a similarity index based on the Hamming-distance is based the number of “distinct” classes.
Scheme 2. We can also seek the latent similarity between binary label vectors in a transformed and more compact real space. In the first stage, we map each -dimensional binary label vector to a -dimensional real space ( ) and obtain a set of transformed label vectors . Of the many ways for achieving this, one is to employ a projection technique that maximizes the variance of the projections as
This is actually PCA in the binary label space, mapping the -dimensional label vectors into a more compact number of uncorrelated directions. The mappings are generated by the top right singular vectors of . Other ways to generate the would be to use any of the UDR algorithms described in Section 2.1 or their corresponding kernel versions with as the input matrix.
This is computed from the Minkowski distance passed through a Gaussian to map to a similarity value.
Tanimoto similarity coefficient
This is an extended cosine similarity (becoming the Jaccard coefficient when applied to binary vectors).
Scheme 3. In this scheme, we consider the strong degree of dependence the classes may possess due to many samples belonging to multiple classes. A flexible way of taking this into account is to define the similarity between any two label vectors as
or equivalently in matrix form (see also Table 2). Equation (42) averages the similarities of pairs of classes either the th or the th sample belong to. is the similarity matrix between classes and can be constructed via different mechanisms directly from the label matrix . The most natural is , where is the number of samples shared by the th and th classes. Other more sophisticated ways to compute can be implemented. For example, by representing each class with an -dimensional binary vector , all similarity indices presented in Schemes 1 and 2 can be used to obtain , but with used as the input instead. For some domain-specific classification problems, one can also compute from a feature space where each class is directly representable with a set of domain-specific features. For example, in document classification each document is represented using different words with features being the number of times each word occurs in each document. The task is to assign the document to different categories, such as sports, politics, and science. In this case, one can represent each category also with subsets of words, and class features can either be the number of times the word occurs in those documents belonging to a specific category, given as , or binary indicators showing whether the word occurs in those documents belonging to that specific category. Subsequently, can be computed from these word features using the techniques from Section 2.1.1.
Discussion. The proximity matrix obtained from the three proposed schemes are formulated in the bottom of Table 2. Scheme 1 directly computes the similarities from the original label vectors using a string-based measure, while Scheme 2 from a set of projected label vectors using a real vector-based similarity estimation. Within each scheme, different measures would probably lead to with similar overall structures, but with different matrix element values reflecting the type and degree of “closeness” between the embedded points. It should be mentioned that when the problem at hand has a large number of classes, such as text categorization with large taxonomies [ 70], the label matrix is usually very sparse due to the lack of training samples for some classes. In this case, Scheme 2 is preferred over Scheme 1, as the projected label vectors provide a more compact, simplified, and robust representation with reduced noise. Compared to Schemes 1 and 2, Scheme 3 is more domain oriented because it incorporates class dependence, and so it potentially models the label-based proximity in a more precise manner, which could be beneficial in applications where class memberships are not independent.
3.3.2 A Priority-Based Combination Mechanism Ideally, if the features can accurately describe all the discriminating characteristics, the proximity structures computed from features and labels should be very similar. However, when processing real data sets, certain characteristics, such as noneasily separable distributions or cases where patterns closely/distantly located in the feature space are from different/same classes, may lead to incompatible proximities in the feature (Section 2.1.1) and label (Section 3.3.1) spaces. This may lead to increased classification errors.
Given two proximity matrices and representing the two types of proximity information sources, most of the existing algorithms (e.g., SLVM and OLPP-R, discussed in the beginning of Section 3.3) although they allow the users to control the degree of preference with a weight parameter, they assume that these two types are equally important. However, given a classification task, the label information of the training samples constitutes a more dominant and driving force of the problem. Also, the feature information is usually more susceptible to noise and imperfections of the employed feature extraction or preprocessing stages. Therefore, to equip the proposed MOPE framework with a more robust information combining mechanism to calculate , we consider the feature-based proximity as secondary source of information, while the label-based proximity as a primary one. To implement this setup we propose using the combining function:
where is the combined similarity between two data points, and the two information sources do or can be scaled to satisfy . This function is a monotonically increasing function of and and is controlled by three parameters: to set the lower bound, and to control the rate of interaction of and . Equation (43) is designed so that the contribution of is primarily restricted by and on its own cannot increase too much unless is adequately high to support it. In this way, gives priority to over . The overall behavior of (43) is seen from
and also from the example of Fig. 1. The hyperparameters , , and can be set by the user or tuned by cross-validation procedures to adjust the interaction of the proximity information to fit the problem at hand. When , the proximity matrix is only computed from the class information, as in CCA, PLS, MDDM, HSL, MMC, etc. Similar mechanisms for combining sources of information with different priorities have also been used in [ 71] in different context. More powerful combining functions could also be designed to generalize or extend (43), e.g., with , which supports more options besides the priority-based mechanism, such as on its own ( and ) and weighted sum of and ( , , and ).
We calculate W as in Section 2.1.1 for all sample pairs and using a method from Schemes 1-3; then, the combined similarities are obtained as described above. Finally, we define using a -NN search as in Section 2.1.1, using as edge weights either constant values or the similarities.
Reuters documents. The “Reuters-21578 Text Categorization Test Collection” contains articles taken from the Reuters newswire, 1 where each article is designated into one or more semantic categories. A total number of 9,980 articles from 10 overlapped categories were used in our experiments, with each category having between 400 and 4,000 articles. We randomly divide the articles from each category into three partitions with nearly the same size, for the purpose of training, validation, and test. This leads to 3,328 articles for training and 3,326 articles for validation and test, respectively. There are around 18 percent of these articles belonging to two to four different categories at the same time, while each of the rest belong to a single category.
Education evidence portal (EEP) documents. A collection of documents supplied by EEP, 2 where each document is a quite lengthy full paper or report (approximately 250 KB on average after converting to plain text). Domain experts have developed a taxonomy of 107 concept categories in the area and manually assigned categories to documents stored in the database. This manual effort has resulted in 2,149 documents, including 1,928 training documents and 221 test documents, with mostly multiple categories assigned. Among these 107 categories, the five biggest ones contain more than 500 documents each, while most of the remaining ones contain between 1 to 300 documents. There are around 96 percent of these documents assigned 2 to 17 different categories, while each of the remaining go to one only category.
Experiment 1 was conducted for SDR. The proposed MESD extension was applied to three single-label SDR methods, including FDA, LFDA, and OLPP-R. The three schemes to compute the label-based proximity matrix of MOPE were also compared. The proposed combination mechanism in (43) was compared with the Hadamard multiplication used by LFDA and the weighted sum used by SLVM, with the same configuration of and . The proposed methods were also compared with three existing UDR methods (LSI, PCA, and OLPP) and six existing multilabel SDR methods (MDDM, CCA, the three versions of HSL, and MORP).
Experiment 2 was conducted for SSDR. Only part of the training samples were used as labeled data and only the resulting embeddings of the labeled data were used to train the classifier. We combined OLPP and the best performing MOPE instance from experiment 1, using the proposed SSDR framework, leading to semi-supervised MOPE (SSMOPE). The results were compared with those obtained by solely using OLPP or MOPE, as well as the existing semi-supervised multilabel algorithm ML-LGC.
4.3.1 Experiment 1 First, we compare the label-based proximity matrices for MOPE computed with the proposed schemes, with the label-based matrices used by CCA, the two versions of HSL with the highest and lowest classification errors, as well as the feature-based proximity matrix used by OLPP. Figs. 2 and 3 present a visual qualitative comparison between these matrices and also include their corresponding numerical classification performances. To delineate the potential class structure of these proximity matrices, we have reordered the training samples in a way that samples with more similar label vectors are grouped together; to do that we applied the data seriation algorithm coVAT [ 78] on the label matrix . It can be seen that the feature-based proximity matrix of the EEP documents in Fig. 3a does not possess obvious class structure and is incompatible to many of the label-based proximity matrices in Fig. 3. On the contrary, the matrix of Reuters documents in Fig. 2a possesses roughly similar class structure to those matrices in Fig. 2, though less distinct. This indicates that the EEP set includes more noisy features than the Reuters set. It can be seen from Figs. 2 and 3 that compared to the label-based proximity matrices used by CCA and HSL, our proposed schemes generate proximity matrices with sharper class structures and possess comparatively higher classification performance, especially for the EEP set. Among the three proposed schemes, Scheme 3 works the best for both data sets, and therefore it is used to compute the combined proximity matrix for MOPE.
We also compare the performance of various existing and proposed SDR and UDR methods in Fig. 4, and MOPE equipped with the proposed combination mechanism as well as two existing ones (Hadamard and weighted sum; see Table 3). It can be seen that the embeddings obtained by SDR possess better discriminating ability than those obtained by UDR, but with the existing SDR method MDDM not performing well for either data sets. All the multilabel SDR algorithms derived from the MESD and MOPE frameworks provide performance similar to or better than existing methods. The top three best-performing algorithms are all generated by MOPE and MESD for both data sets (see Fig. 4). The proposed combination mechanism for MOPE performs better than the Hadamard and weighted sum ones, which sometimes even reduce the performance obtained by using or alone, as seen in Table 3. The performance improvement achieved by using advanced SDR methods (e.g., HSL, MESD, and MOPE) is higher for the EEP data set (3 to 7 percent) than Reuters (0.2 to 1.2 percent), as EEP seems to suffer from the noisy features.
We also demonstrate the reduction of computational cost by using the relation features instead of the original ones. With methods expressed by (31) based on linear projections, one needs to decompose or compute the inverse of a matrix, where the original features are for the EEP and for the Reuters set. This makes it highly impractical to perform the classification task. By using the relation features, the computational cost is only related to the number of training samples . This corresponds to less than 300 secs for both data set, and no more than 2,500 secs for the comparatively larger Reuters set when generalized eigen-decomposition is involved.
4.3.2 Experiment 2 The SSDR performance for each data set averaged over all classes, is shown in Table 4 where different percentages of the training samples are used as the labeled data (40 or 80 percent). It can be seen that the proposed SSDR framework of Section 3.4 (whose instance chosen here, referred to as SSMOPE, combines the UDR method OLPP and an SDR instance derived from the MOPE framework of Section 3.3) outperforms the existing ML-LGC. As can be seen by the 40 percent rows of the table, when there is not enough labeled data available, the embeddings computed from both labeled and unlabeled data can possess better discriminating ability than those computed from only labeled data, as in MOPE, or by ignoring label information altogether, as in OLPP here.
T. Mu and S. Ananiadou are with the National Centre for Text Mining (NaCTeM), School of Computer Science, University of Manchester, MIB Building, 131 Princess Street, Manchester M1 7DN, United Kingdom.
E-mail: email@example.com, firstname.lastname@example.org.
J.Y. Goulermas is with the Department of Electrical Engineering and Electronics, University of Liverpool, Brownlow Hill, Liverpool L69 3GJ, United Kingdom. E-mail: email@example.com.
J. Tsujii is with Microsoft Research Asia, Beijing, China.
Manuscript received 18 Oct. 2010; revised 6 Aug. 2011; accepted 25 Dec. 2011; published online 9 Jan. 2012.
Recommended for acceptance by K. Murphy.
For information on obtaining reprints of this article, please send e-mail to: firstname.lastname@example.org, and reference IEEECS Log Number TPAMI-2010-10-0799.
Digital Object Identifier no. 10.1109/TPAMI.2012.20.
1. http://archive.ics.uci.edu/ml/support/Reuters-21578+Text+Cate gorization+Collection.
4. Our code can be downloaded from http://pcwww.liv.ac.uk/ ~goulerma/software/mesd-mope.zip.
Tingting Mu received the BEng degree in electronic engineering and information science from the University of Science and Technology of China, Hefei, China, in 2004, and the PhD degree in electrical engineering and electronics from the University of Liverpool, Liverpool, United Kingdom, in 2008. She is currently a postdoctoral researcher with the School of Computing, Informatics and Media, University of Bradford, United Kingdom. Her current research interests include machine learning and pattern recognition, with applications to text mining, bioinformatics, biomedical engineering, and intelligent transportation systems. She is a member of the IEEE.
John Yannis Goulermas received the BSc degree (first class) in computation from the University of Manchester Institute of Science and Technology (UMIST), Manchester, United Kingdom, in 1994, and the MSc degree by research and the PhD degree from the Control Systems Center, Department of Electrical and Electronic Engineering, UMIST, in 1996 and 2000, respectively. He is currently a reader in the Department of Electrical Engineering and Electronics, University of Liverpool, Liverpool, United Kingdom. His current research interests include machine learning, mathematical modeling/optimization, and image processing with application areas biomedical engineering and industrial monitoring and control. He is a senior member of the IEEE.
Jun'ichi Tsujii is a principal researcher with Microsoft Research Asia, China. Until March, 2011 he was a professor of text mining in the School of Computer Science, University of Manchester, and a professor of computer science at the University of Tokyo. He has worked since 1973 in NLP, question answering, text mining, and machine translation. His recent research achievements include: deep semantic parsing based on feature forest model, efficient search algorithms for statistical parsing, improvement of estimator for maximum entropy model, and construction of the gold standard corpus (GENIA) for biomedical text mining. He has authored more than 300 papers in journals, conferences, and books. He was the president of the Association for Computational Linguistics (ACL) in 2006 and has been a permanent member of the International Committee on Computational Linguistics (ICCL) since 1992. He received the IBM Science Award (1988), the IBM Faculty Award (2005), and the Medal of Honor with Purple Ribbon from the Government of Japan in 2010.
Sophia Ananiadou is a director of the National Centre for Text Mining (NaCTeM) and a professor in Text Mining in the School of Computer Science, University of Manchester. She is the main designer of the text-mining tools and services currently used in NaCTeM, i.e., terminology management, information extraction, intelligent searching, and association mining. Her research projects include text mining-based visualization of biochemical networks, data integration using text mining, building biolexica and bio-ontologies for gene regulation, automatic event extraction of bioprocesses, as well as EC FP7 project MetaNet4U and research industrially funded by Pfizer/AstraZeneca/IBM/BBC. She has been awarded the Daiwa Adrian prize (2004) and the IBM UIMA innovation award (2006, 2007, 2008) for her leading work on text-mining tools in biomedicine. She has more than 160 publications in journals, conferences, and books.