Proximity-Based Frameworks for Generating Embeddings from Multi-Output Data
NOVEMBER 2012 (Vol. 34, No. 11) pp. 2216-2232
0162-8828/12/$31.00 © 2012 IEEE

Published by the IEEE Computer Society
Proximity-Based Frameworks for Generating Embeddings from Multi-Output Data
Tingting Mu , IEEE Member

John Yannis Goulermas , IEEE Senior Member

Jun'ichi Tsujii

Sophia Ananiadou
  Article Contents  
  Introduction  
  Previous Spectral Dimensionality Reduction Techniques  
  Proposed Methods  
  Experiments  
  Conclusion  
  Acknowledgments  
  References  
Download Citation
   
Download Content
 
PDFs Require Adobe Acrobat
 

Abstract—This paper is about supervised and semi-supervised dimensionality reduction (DR) by generating spectral embeddings from multi-output data based on the pairwise proximity information. Two flexible and generic frameworks are proposed to achieve supervised DR (SDR) for multilabel classification. One is able to extend any existing single-label SDR to multilabel via sample duplication, referred to as MESD. The other is a multilabel design framework that tackles the SDR problem by computing weight (proximity) matrices based on simultaneous feature and label information, referred to as MOPE, as a generalization of many current techniques. A diverse set of different schemes for label-based proximity calculation, as well as a mechanism for combining label-based and feature-based weight information by considering information importance and prioritization, are proposed for MOPE. Additionally, we summarize many current spectral methods for unsupervised DR (UDR), single/multilabel SDR, and semi-supervised DR (SSDR) and express them under a common template representation as a general guide to researchers in the field. We also propose a general framework for achieving SSDR by combining existing SDR and UDR models, and also a procedure of reducing the computational cost via learning with a target set of relation features. The effectiveness of our proposed methodologies is demonstrated with experiments with document collections for multilabel text categorization from the natural language processing domain.

Introduction
Dimensionality reduction constitutes a significant problem in machine learning, data compression, computer vision, information retrieval, and natural language processing (NLP), as well as many areas of science and engineering where high-dimensional measurements are frequently available. For instance, in the NLP field, rich feature representations are increasingly preferred because they have been shown to improve the performance of various learning tasks [ 1 ], [ 2 ]. However, for many text mining applications, rich features give rise to extremely high-dimensional feature spaces that may render the learning algorithms intractable. Therefore, the removal of redundant information from large-scale features becomes a necessary preprocessing step in many NLP and text mining tasks. Relevant reduction techniques commonly used in NLP include feature selection using wrapper or filter models [ 3 ], [ 4 ], feature clustering [ 4 ], [ 5 ], or latent variable models [ 6 ], [ 7 ].
Recently a more sophisticated methodology for dimensionality reduction has been developed through manifold learning and spectral analysis [ 8 ], [ 9 ], [ 10 ], [ 11 ], [ 12 ], [ 13 ], [ 14 ], [ 15 ], [ 16 ], [ 17 ], [ 18 ], [ 19 ]. Algorithms of this type generate lower dimensional embeddings which preserve certain properties and characteristics of the original high-dimensional feature space, such as pairwise proximity information based on local neighborhood graphs. Typically, they operate in an unsupervised manner as they do not utilize any label or output information. Although such unsupervised dimensionality reduction (UDR) provides a compact representation of the data when it is used as a preprocessing step serving a supervised classification or regression task, it may not always facilitate its final performance. This is because real-world data sets are often noisy and may possess incompatibilities between the input features and the output labels (e.g., samples closely/distantly located in the feature space are from different/same classes). Additionally, because such methods are based on local similarity statistics, these problems are often exacerbated.
When the data set is accompanied by complete/partial label information, it becomes natural to pursue supervised/semi-supervised dimensionality reduction (SDR/SSDR). Current SDR methods are either based on the between and within-class information and are closely related to Fisher discriminant analysis [ 19 ], [ 20 ], [ 21 ], [ 22 ], [ 23 ], [ 24 ], [ 25 ], or modify various UDR methods by incorporating label information to the proximities governing the embeddings [ 26 ], [ 27 ], [ 28 ], [ 29 ], [ 30 ]. For SSDR, many existing methods combine together the proximity information from the SDR and UDR algorithms [ 29 ], [ 31 ], [ 32 ]. A broader review on semi-supervised methods can be found in [ 33 ].
Nevertheless, the structure of the available label information may restrict the usefulness of these SDR and SSDR techniques, as they are only applicable to single-label classification, where each sample belongs to exactly one from two or more classes. For complex classification tasks reflecting the needs of contemporary applications, embedding generation may have to take into account the existence of multiple labels. In multilabel classification, each sample is allowed to simultaneously belong to more than one class, where different samples share subsets of the class labels. Since such a classification setup is important in NLP, text mining, and bioinformatics [ 34 ], [ 35 ], multilabel SDR/SSDR have attracted increasing interest with various algorithms proposed. These are based on combining the feature and label information [ 36 ], [ 37 ], maximizing label correlations [ 38 ], [ 39 ], using hypergraph modeling [ 40 ], or optimizing statistical criteria between embeddings and label vectors [ 41 ], [ 42 ], [ 43 ], [ 44 ], [ 45 ], [ 46 ], [ 47 ].
The above methods are spectral approaches. Generative ones, based on Gaussian process latent variable models [ 48 ] [ 49 ] and maximum entropy unfolding [ 50 ], which explicitly model the data distributions and learn the covariance matrix, have also been proposed. The advantages of generative methods are that they can recover the manifold and generalize from fewer and more noisy training samples. They perform the mapping of the latent space to the high-dimensional data space using smoothness constraints to preserve global or local proximities. On the other hand, spectral methods support more explicit and flexible modeling of the mapping of the data to the embedding space. They are more direct to optimize and incorporate data constraints. A recent analysis on these can be found in [ 51 ].
In this work, we focus on spectral SDR and SSDR for multilabel classification. The contributions of this paper include the proposal of different novel embedding methods, and also a comparative survey of many current spectral techniques for UDR and single/multilabel SDR and SSDR. We summarize and relate many existing techniques, and reexpress them under a common template representation as a general guide to practitioners in the field. We propose a simple framework, referred to as multilabel extension via sample duplication (MESD), for extending all existing single-label SDR to multilabel via embedding generation based on sample duplication. This provides a practical and easy mechanism to apply for efficiently solving multilabel problems. More importantly, we propose a multilabel design framework, referred to as multioutput proximity-based embeddings (MOPE), that tackles the SDR problem by computing weight matrices based on simultaneous feature and label information, as a generalization of many existing techniques. MOPE offers a flexible methodology for constructing the label-based proximity structures between samples, using a diverse set of schemes that can be readily applied to any existing data sets. To improve the robustness of combining label-based and feature-based information, we also develop an adaptive mechanism which relies on information prioritization. Finally, we introduce a general platform for achieving SSDR by combining existing SDR and UDR methods, provide the kernel extension of the proposed methods, and also show how embeddings can be obtained with reduced computational cost by learning a set of relation features.
The organization of this paper is as follows: Section 2 gives a succinct but thorough review of the state-of-the-art UDR and single/multilabel SDR and SSDR techniques. The sections of Section 3 analyze the different contributions stated in the previous paragraph. Section 4 reports the experimental results and comparative analyses, while Section 5 concludes the work.
2. Previous Spectral Dimensionality Reduction Techniques
Given a set of data points (samples) $\{{{\schmi{\schmi x}}_{i}}\}_{i=1}^{n}$ of dimension $d$ , where ${\schmi{\schmi x}}_{i}=[x_{i1},\;x_{i2},\;\ldots,\;x_{id}]^T$ , the goal of dimensionality reduction is to generate a set of optimal embeddings $\{{\schmi{\schmi z}}_{i}\}_{i=1}^{n}$ of dimension $k$ ( $k \ll d$ ), where ${\schmi{\schmi z}}_i=[z_{i1},\;z_{i2},\;\ldots,\;z_{ik}]^T$ so that the transformed $n\times k$ feature matrix ${\bf Z}=[z_{ij}]$ is an accurate representation of the original $n\times d$ feature matrix ${\bf X}=[x_{ij}]$ while (when applicable) it supports improved discrimination between classes.
2.1 Unsupervised Dimensionality Reduction
Principal component analysis (PCA) is the most popular UDR method, which maps feature vectors into a smaller number of uncorrelated directions. The method extracts a $d\times k$ orthogonal projection matrix $\bf P$ so that the variance of the projected vectors is maximized [ 52 ]:


$$\mathop{\rm max}_{{{\scriptstyle {{{\bf P}\in R^{d\times k},\atop {\bf P}^T{\bf P}={\bf I}_{k\times k}}}}}}{1\over n-1} \sum_{i=1}^{n}\left\Vert {\bf P}^{T}{\schmi{\schmi x}}_i-{1\over n} \sum_{j=1}^n{\bf P}^{T}{\schmi{\schmi x}}_j\right\Vert_2^2.$$


(1)



Latent semantic indexing (LSI) is similar to PCA without centering and is frequently used in information/image retrieval [ 6 ]. The method defines a $d\times k$ orthogonal projection matrix $\bf P$ to enable optimal reconstruction in terms of an error defined via the Frobenius norm:


$$\mathop{\rm min}_{{{\scriptstyle {{{\bf P}\in R^{d\times k},\atop {\bf P}^T{\bf P}={\bf I}_{k\times k}}}}}} \Vert {\bf X}-{\bf X}{\bf P}{\bf P}^T\Vert_{{\rm F}}^2.$$


(2)



Other UDR methods based on graph formulations attempt to minimize the penalized pairwise distance error between all the embeddings:


$$\min_{\{{\schmi{\schmi z}}_{i}\in R^{k}\}_{i=1}^{n}}{1\over 2} \sum_{i,j=1}^n w_{ij}\Vert {\schmi{\schmi z}}_{i}-{\schmi{\schmi z}}_{j}\Vert_{2}^2,$$


(3)



where $w_{ij}$ is a weight defining the degree of similarity or closeness between the $i$ th and $j$ th samples. This is the common underlying error measure in many embedding methods, and is expressed in matrix notation using the $n\times n$ weight matrix ${\bf W}=[w_{ij}]$ and a Laplacian $\bf L$ as


$$\mathop{\rm min}_{{\bf Z}\in R^{n\times k}}{\rm tr}[{\bf Z}^T{\bf L}{\bf Z}].$$


(4)



There are different types of Laplacian matrices. The version defined as ${\bf L}={\bf D}({\bf W})-{\bf W}$ , where ${\bf D}({\bf W})$ is a diagonal matrix formed by the vector ${\bf W}\times {\bf 1}_{n\times 1}$ , makes (3) and (4) equivalent. To obtain $k$ different columns in $\bf Z$ different constraints are enforced. The orthogonality conditions ${\bf Z}^T{\bf Z}={\bf I}_{k\times k}$ in (4) give rise to the unnormalized spectral clustering (USC) [ 14 ] and is solved with eigen-decomposition of $\bf L$ . Alternatively, the normalized spectral clustering (NSC) [ 12 ] and Laplacian eigenmaps (LE) [ 16 ] define the constraints ${\bf Z}^T{\bf D}({\bf W}){\bf Z}={\bf I}_{k\times k}$ . This is also equivalent to using the Laplacian ${\bf L}={\bf I}-{\bf D}({\bf W})^{-1}{\bf W}$ and is solved as a generalized eigenvalue problem. Another version is the symmetric Laplacian ${\bf L}={\bf I}-{\bf D}({\bf W})^{-{1\over 2} }{\bf W}{\bf D}({\bf W})^{-{1\over 2} }$ and assumes orthogonality on $\bf Z$ [ 13 ], [ 14 ].
Another UDR technique is multidimensional scaling (MDS), which generates embeddings with distances between them being correspondingly close to those of the original data points. MDS solves the following optimization problem [ 53 ]:


$$\mathop{\rm min}_{{\bf Z}\in R^{n\times k}} \left\Vert -{1\over 2} \left({\bf I}-{1\over n}{\bf 1}_{n\times n}\right){\bf E}\left({\bf I}-{1\over n}{\bf 1}_{n\times n}\right)-{\bf Z}^T{\bf Z}\right\Vert_F^2,$$


(5)



where $\bf E$ is the $n\times n$ matrix containing squared euclidean distances between samples computed from the original feature matrix $\bf X$ . By replacing the euclidean distances with the geodesic distances, MDS becomes Isomap [ 15 ].
Locally linear embedding (LLE) is another popular method which first calculates a weight matrix $\bf W$ based on local linear reconstructions of each point from its $K$ -nearest neighbors ( $K$ -NNs) with the constraint ${\bf W}\times {\bf 1}_{n\times 1}={\bf 1}_{n\times 1}$ [ 10 ]. Subsequently, all embeddings are recovered to comply with the reconstruction weights by solving the optimization problem:


$$\mathop{\rm min}_{{{\scriptstyle {{{\bf Z}\in R^{n\times k},\atop {\bf Z}^T{\bf Z}={\bf I}_{k\times k}}}}}}{\rm tr}\left[{\bf Z}^{T}({\bf I}_{n\times n}-{\bf W}^T)({\bf I}_{n\times n}-{\bf W}){\bf Z}\right].$$


(6)



Another class of techniques is based on reexpressing previous methods using the additional linear constraint ${\bf Z}={\bf X}{\bf P}$ , which forces the embeddings to be linear combinations of the original features. For instance, locality preserving projections (LPP) [ 17 ] and orthogonal LPP (OLPP) [ 11 ] are such linear variations of LE, while neighborhood preserving projections (NPP) and orthogonal NPP (ONPP) [ 11 ] of LLE. OLPP and ONPP directly impose orthogonality ${\bf P}^T{\bf P}={\bf I}_{k\times k}$ on the projection matrix, while NPP requires ${\bf P}^T{\bf X}^T{\bf X}{\bf P}={\bf I}_{k\times k}$ and LPP either the last constraint or ${\bf P}^T{\bf X}^T{\bf D}({\bf W}){\bf X}{\bf P}={\bf I}_{k\times k}$ .


2.1.1 Feature-Based Weight Matrix In general, the weight matrix $\bf W$ used by the above algorithms is controlled by an adjacency graph. Each weight $w_{ij}$ is nonzero only for adjacent nodes in the graph. There are two principal ways to define this adjacency:

    • when two samples are the (mutual or undirected) $K$ -NNs of each other,

    • when a certain "closeness" measure between two samples is satisfied.

There are various alternatives to defining $w_{ij}$ :

    • Constant values $w_{ij}=1$ if the $i$ th and $j$ th samples are adjacent, and $w_{ij}=0$ otherwise.

    • The Gaussian kernel, giving



    $$w_{ij}=\exp \left({-\Vert {\schmi{\schmi x}}_i-{\schmi{\schmi x}}_j\Vert_{2}^2\over \tau } \right),$$


    (7)



    and $\tau >0$ .

    • Local-scaling Gaussian kernel [ 54 ], with



    $$w_{ij}=\exp \left({-\Vert {\schmi{\schmi x}}_i-{\schmi{\schmi x}}_j\Vert_{2}^2\over \Vert {\schmi{\schmi x}}_i-{\schmi{\schmi x}}_i^{K}\Vert_{2}\Vert {\schmi{\schmi x}}_j-{\schmi{\schmi x}}_j^{K}\Vert_{2}} \right),$$


    (8)



    and ${\schmi{\schmi x}}_i^{K}$ denotes the $K$ th nearest neighbor of ${\schmi{\schmi x}}_{i}$ .

    • An alternative weight definition [ 27 ], based on



    $$w_{ij}={1\over \tau +{\Vert {\schmi{\schmi x}}_i-{\schmi{\schmi x}}_j\Vert_{2}^2\over \Vert {\schmi{\schmi x}}_i\Vert_{2}^2+\Vert {\schmi{\schmi x}}_j\Vert_{2}^2} } .$$


    (9)



    • The optimal weight reconstruction matrix of LLE [ 10 ], as mentioned earlier.

    • Any user-defined domain-dependent set of similarities between the samples [ 55 ].

    • Any of the classical cosine norm, Pearson's, Spearman's or Kendall's correlation coefficients, etc.

    • Weights depending on distances based on geodesic distances as in Isomap [ 15 ] and related updating schemes [ 56 ].

2.2 Supervised Dimensionality Reduction
In this type of technique, the datasets include label information. The objective is to generate embeddings that, in addition to compacting information, facilitate the underlying discrimination task.


2.2.1 Single-Label Classification Fisher discriminant analysis (FDA) [ 20 ] is the most commonly used linear SDR method for single-label classification, employing the projection ${\bf Z}={\bf X}{\bf P}$ . It computes the optimal projection matrix as

$$\mathop{\rm max}_{{\bf P}\in R^{d\times k} }{{\rm tr}[{\bf P}^T{\bf S}_b{\bf P}]\over {\rm tr}[{\bf P}^T{\bf S}_w{\bf P}]},$$


(10)



where ${\bf S}_b$ and ${\bf S}_w$ are the between-class and within-class scatter matrices. FDA motivates a group of other related SDR techniques that pursue embeddings or projections by employing different types for setting up the between- and within-class matrices. For instance, by incorporating the local structure of the data into the Fisher criterion, local FDA (LFDA) combines the ideas of LPP and FDA [ 24 ]. LFDA solves the same optimization problem as in (10), but with ${\bf S}_b$ and ${\bf S}_w$ redefined as in [ 24 ] using local information. Mariginal Fisher analysis (MFA) is another variant of FDA that also considers the local data structure [ 23 ], and solves (10), with ${\bf S}_w$ and ${\bf S}_b$ based on intraclass $K_1$ -NNs and interclass $K_2$ -nearest pairs ( $K_{2}$ -NPs).

With the same goal as FDA, the maximum margin criterion (MMC) is proposed in the form of $S_b-S_w$ , instead of ${S_b\over S_w}$ used by Fisher criterion. The MMC-based supervised embeddings can be obtained by solving the following optimization problem [ 21 ]:



$$\max_{{{\scriptstyle {{{\bf P}\in R^{d\times k},\atop {\bf P}^T{\bf P}={\bf I}_{k\times k}}}}}}{\rm tr}[{\bf P}^T({\bf S}_b-{\bf S}_w){\bf P}].$$


(11)



A regularized version of MMC in the form of $S_b-\lambda S_w$ , is sometimes preferred for implementation with $\lambda >0$ [ 31 ]. Discriminative locality alignment (DLA) is a recently proposed SDR method, defined by Zhang et al. [ 19 ]:



$${\rm min}\sum_{i=1}^n\sum_{j=1}^{K_1}\left\Vert {\schmi{\schmi x}}_i-\bar{{\schmi{\schmi x}}}_i^{j}\right\Vert^2_2-\lambda \sum_{i=1}^n\sum_{j=1}^{K_2}\left\Vert {\schmi{\schmi x}}_i-\hat{{\schmi{\schmi x}}}_i^{j}\right\Vert^2_2,$$


(12)



where the data points $\{\bar{{\schmi{\schmi x}}}_i^{j}\}_{j=1}^{K_1}$ denote the intraclass $K_1$ -NNs of ${\schmi{\schmi x}}_i$ , and $\{\hat{{\schmi{\schmi x}}}_i^{j}\}_{j=1}^{K_2}$ denote the interclass $K_2$ -NNs of ${\schmi{\schmi x}}_i$ . By viewing the first term of (12) as a modified within-class scatter constructed from intraclass neighbors and the second term as a modified between-class scatter constructed from interclass neighbors, DLA minimizes a criterion in the form of $S_w-\lambda S_b$ . This can be viewed as a modified MMC using the local structure of data.

Another group of SDR techniques has been developed by incorporating the label information into OLPP. Discriminant neighborhood embedding (DNE) replaces the weight matrix of OLPP with the following [ 26 ]:



$$w_{ij}=\left\{ \matrix{+1 & {\rm if} \; {\schmi x_i}\; {\rm and} \; {\schmi x_j} \; {{\rm are \;intraclass} \;K\hbox{-}{\rm NNs}},\cr -1 & {\rm if} \; {\schmi x_i}\; {\rm and} \; {\schmi x_j} \; {{\rm are \;interclass} \;K\hbox{-}{\rm NNs}},\cr 0 & {\rm otherwise}. \hfill} \right.$$


(13)



This also achieves minimization of distances between the intraclass neighbors, while maximization of distances between the interclass neighbors is achieved as DLA does. Repulsion OLPP (OLPP-R) is an SDR method which replaces the Laplacian in OLPP with the linear combination of a repulsion ${\bf L}_r={\bf D}({\bf W}_r)-{\bf W}_r$ and a class Laplacian ${\bf L}_s={\bf D}({\bf W}_s)-{\bf W}_s$ [ 27 ]. The two weight matrices ${\bf W}_s$ and ${\bf W}_r$ are set to have those nonzero elements corresponding to pairs of points from the same class and nearby pairs from different classes, respectively. The final Laplacian of OLPP-R is ${\bf L}={\bf L}_s-\beta {\bf L}_r$ , where $\beta >0$ is a user-defined parameter.

The unsupervised ONPP method has also been adapted for SDR. For example, the supervised ONPP (SONPP) [ 11 ] uses LLE-style reconstruction weights calculated for all samples from the same class. Discriminative ONPP (DONPP) [ 29 ] uses slightly different reconstruction weights from SONPP, calculated by assuming each sample is reconstructed from the remaining samples from its class, to minimize the embedding distances, and, at the same time, it maximizes the distances of the embeddings corresponding to nearby samples from different classes. The repulsion ONPP (ONPP-R) [ 27 ] uses a similar weight matrix to SONPP, but incorporates a repulsion Laplacian as OLPP-R.



2.2.2 Multilabel Classification Given a multilabel classification data set, most SDR algorithms model the class (target) information as an $n\times c$ output matrix ${\bf Y}=[{\schmi{\schmi y}}_{1},\;{\schmi{\schmi y}}_{2},\;\ldots,\;.{\schmi{\schmi y}}_{n}]^{T}$ [ 37 ], [ 42 ], [ 43 ], [ 44 ], of which each element $y_{ij}=1$ if the $i$ th sample belongs to the $j$ th class ${\cal C}_j$ , and $y_{ij}=0$ (or $y_{ij}=-1$ sometimes) otherwise.

Classical SDR algorithms for multilabel classification compute embeddings by optimizing a statistical criterion between the embeddings and multiple labels. Canonical correlation analysis (CCA), for example, maximizes the correlation coefficient between the projected features and the labels [ 41 ], [ 43 ] as



$$\max_{{{{{{\bf P}\in R^{d\times k},\atop {\bf P}^T{\bf X}_{c}^{T}{\bf X}_{c}{\bf P}={\bf I}_{k\times k}}}}}}{\rm tr}\big[{\bf P}^T{\bf X}_{c}^T{\bf Y}_{c}\big({\bf Y}^{T}_{c}{\bf Y}_{c}\big)^{-1}{\bf Y}_{c}^{T}{\bf X}_{c}{\bf P}\big],$$


(14)



where ${\bf X}_{c}=({\bf I}_{n\times n}-{1\over n}{\bf 1}_{n\times n}){\bf X}$ and ${\bf Y}_{c}=({\bf I}_{n\times n}-{1\over n}{\bf 1}_{n\times n}){\bf Y}$ are the centered feature and label matrices. The regularized CCA (rCCA) modifies the constraint of the original CCA to ${\bf P}^T({\bf X}_{c}^{T}{\bf X}_{c}+\lambda {\bf I}_{d\times d}){\bf P}={\bf I}_{k\times k}$ to prevent overfitting and avoid the singularity of ${\bf X}_{c}^{T}{\bf X}_{c}$ , where $\lambda >0$ is the user-defined regularization parameter [ 43 ], [ 57 ]. Partial least squares (PLS) maximizes the covariance between the projected features and the labels [ 42 ], [ 58 ]:



$$\mathop{\rm max}_{{{\scriptstyle {{{\bf P}\in R^{d\times k},\atop {\bf P}^T{\bf P}={\bf I}_{k\times k}}}}}}{\rm tr}\big[{\bf P}^T{\bf X}_{c}^T{\bf Y}_{c}{\bf Y}_{c}^{T}{\bf X}_{c}{\bf P}\big].$$


(15)



The orthonormalized PLS (OPLS) changes the constraint of (15) to ${\bf P}^T{\bf X}_{c}^{T}{\bf X}_{c}{\bf P}={\bf I}_{k\times k}$ [ 42 ], [ 59 ]. Both the multilabel dimensionality reduction via dependence maximization (MDDM) [ 44 ] and supervised feature extraction using Hilbert-Schmidt norm (SFEHS) [ 46 ] compute the optimal embeddings by maximizing the Hilbert-Schmidt independence criterion (HSIC) between the projected features and the labels. Using the HSIC defined in [ 60 ], MDDM and SFEHS solve the following optimization problem:



$$\max_{{{{{{\bf P}\in R^{d\times k},\atop {\bf P}^T{\bf P}={\bf I}_{k\times k}}}}}}{\rm tr}\big[{\bf P}^{T}{\bf X}_{c}^{T}{\bf K}_{Y}{\bf X}_{c}{\bf P}\big],$$


(16)



where ${\bf K}_{Y}$ can be either ${\bf Y}{\bf Y}^T$ or the $n\times n$ kernel matrix computed from the label vectors $\{{\schmi{\schmi y}}_{i}\}_{i=1}^{n}$ . Using the HSIC defined in [ 61 ], SFEHS solves a similar optimization problem to (16), but with ${\bf X}_{c}{\bf K}_{Y}{\bf X}_{c}^{T}$ replaced by a different $d\times d$ symmetric matrix (see [ 46 ]).

The hypergraph spectral learning (HSL) method [ 40 ] achieves multilabel SDR by modifying LPP, which replaces the Laplacian matrix of LPP with a hypergraph-based laplacian matrix ${\bf I}_{n\times n}-{\bf S}$ , where $\bf S$ is an $n\times n$ similarity matrix computed from a hypergraph modeling the multilabel class information of the training data set. Three different ways are provided in [ 40 ] to compute this similarity matrix, including clique expansion, star expansion, and the hypergraph Laplacian [ 62 ], corresponding to HSL1, HSL2, and HSL3, respectively.

The supervised latent variable model (SLVM) is another SDR algorithm for multilabel classification aiming the optimal reconstruction of both feature and label matrices in the embedded space. It is based on the optimization [ 36 ], [ 37 ]



$$\mathop{\rm max}_{{{\scriptstyle {{{\bf Z}\in R^{ n \times k},\atop{\bf Z}^T{\bf Z}={\bf I}_{k\times k}}}}}}{\rm tr}\big[{\bf Z}^T\left(\beta {\bf Y}{\bf Y}^{T}+(1-\beta ){\bf X}{\bf X}^{T}\right){\bf Z}\big],$$


(17)



where $0\le \beta \le 1$ is a degree parameter determining how much the embeddings should be biased by the label information. By centering both feature and label matrices, applying the projection technique, and introducing the Tikhonov regularization, the multi-output regularized feature projection (MORP) is also proposed in [ 37 ]:



$$\min_{{{{{{\bf P}\in R^{ d \times k},\atop {\bf P}^T{\bf X}_{c}^{T}{\bf X}_{c}{\bf P}={\bf I}_{k\times k}}}}}}{\rm tr}\left[{\bf P}^T\left({\bf X}_{c}^T {\bf H}{\bf X}_{c}+\lambda {\bf I}_{d\times d}\right){\bf P}\right],$$


(18)



where ${\bf H}=(\beta {\bf Y}_c{\bf Y}_c^{T}+(1-\beta ){\bf X}_c{\bf X}_c^{T})^{-1}$ and $\lambda>0$ controls the regularization.

2.3 Semi-Supervised Dimensionality Reduction
Sometimes, among the $n$ training samples $\{{\schmi{\schmi x}}_{i}\}_{i=1}^{n}$ , only a portion of the samples is labeled. The goal of SSDR algorithms is to generate embeddings from both labeled and unlabeled data. A direct way to achieve SSDR is by combining an SDR algorithm with a UDR one. For instance, the semi-supervised LFDA (SELF) combines LFDA and PCA [ 25 ]. It solves the optimization in (10), but with ${\bf S}_{b}$ and ${\bf S}_{w}$ defined as


$${\bf S}_{b}= (1-\beta ){\bf S}_{b}^{(l)}+{\beta \over 2n} \sum_{i,j=1}^{n}({\schmi{\schmi x}}_{i}-{\schmi{\schmi x}}_{j})({\schmi{\schmi x}}_{i}-{\schmi{\schmi x}}_{j})^{T},$$


(19)





$${\bf S}_{w}= (1-\beta ){\bf S}_{w}^{(l)}+\beta {\bf I}_{d\times d},$$


(20)



where ${\bf S}_{b}^{(l)}$ and ${\bf S}_{w}^{(l)}$ are the scatter matrices computed from the labeled samples as in LFDA, and $0<\beta < 1$ is a tradeoff parameter. SELF can also result from the combination of LFDA and LPP [ 25 ]. Song et al. [ 31 ] propose to combine FDA and MMC, with OLPP to achieve semi-supervised FDA (SSFDA) and semi-supervised MMC (SSMMC), respectively. SSFDA solves the optimization in (10), but modifies the within-class scatter as


$${\bf S}_{w} = {\bf S}_{w}^{(l)}+\beta_{1}{\bf X}^{T}{\bf L}{\bf X}+\beta_{2}{\bf I}_{d\times d},$$


(21)



where ${\bf S}_{w}^{(l)}$ denotes the within-class scatter matrix of FDA computed from the labeled samples, $\bf L$ is the Laplacian used by OLPP computed from both labeled and unlabeled samples, and $\beta_{1},\;\beta_{2}>0$ are user-defined balancing parameters. The SSMMC solves (11) with the same ${\bf S}_{b}$ used by MMC computed from the labeled samples, with the within-class scatter modification


$${\bf S}_{w} = \beta_{1}{\bf S}_{w}^{(l)}+\beta_{2}{\bf X}^{T}{\bf L}{\bf X},$$


(22)



where ${\bf S}_{w}^{(l)}$ is the within-class scatter of MMC computed from the labeled samples. A similar idea for SSDR has also been used by Chatpatanasiri and Kijsirikul [ 32 ] to combine MFA, DNE, and LFDA with OLPP, and [ 29 ] to combine DONPP with ONPP.
A different group of techniques aim at learning $c$ -dimensional embeddings that can be used as predicted labels, e.g., two models for semi-supervised multilabel learning using Sylvester equation (SMSE) [ 38 ], multilabel Gaussian random field (ML-GRF) [ 39 ] and multilabel local and global consistency (ML-LGC) [ 39 ]. Letting ${\bf X}_l$ , ${\bf Y}_l$ , and ${\bf Z}_l$ denote the feature, label, and embedding matrices of the labeled samples and ${\bf Z}_u$ the embedding matrix for the unlabeled samples, this group of methods attempts to achieve the following: 1) best predict the label information ${\bf Y}_l$ from the embeddings ${\bf Z}_{l}$ , 2) preserve the feature-based adjacency between all samples, and 3) drive ${\bf Z}^T$ to preserve a certain similarity between classes. SMSE1 solves the following optimization problem:


$$\min_{{{{{{\bf Z}\in R^{n\times k},\atop {\bf Z}_{l}={\bf Y}_{l} }}}}} \lambda_{1}{\rm tr}[{\bf Z}^{T}{\bf L}_W{\bf Z}]+\lambda_{2}{\rm tr}[{\bf Z}{\bf L}_{\Sigma }{\bf Z}^{T}],$$


(23)



while SMSE2 optimizes


$$\eqalign{ & \min_{{\bf Z}\in R^{n\times c} }{\rm tr}[({\bf Z}_{l}-{\bf Y}_{l})^{T}({\bf Z}_{l}-{\bf Y}_{l})]+{\rm tr}\big[{\bf Z}_{u}^T{\bf Z}_{u}\big]\cr &\quad +\lambda_{1}{\rm tr}[{\bf Z}^{T}{\bf L}_W{\bf Z}]+\lambda_{2}{\rm tr}[{\bf Z}{\bf L}_{\Sigma }{\bf Z}^{T}],}$$


(24)



where $\bf W$ is the weight matrix calculated as discussed in Section 2.1.1, ${\bf \Sigma}$ is a $c\times c$ similarity matrix between classes, ${\bf L}_W$ and ${\bf L}_{\Sigma }$ are Laplacian matrice computed from $\bf W$ and $\Sigma$ , respectively. Alternatively, ML-GRF optimizes


$$\eqalign{& \mathop{\rm min}_{{\bf Z}\in R^{n\times c} }{\rm tr}[({\bf Z}_{l}-{\bf Y}_{l})^{T}({\bf Z}_{l}-{\bf Y}_{l})]\cr &\quad+\lambda_{1}{\rm tr}[{\bf Z}^{T}{\bf L}_W{\bf Z}]-\lambda_{2}{\rm tr}[{\bf Z}{\bf \Sigma}{\bf Z}^{T}],}$$


(25)



while ML-LGC also uses (25) but with the additional term ${\rm tr}[{\bf Z}_{u}^T{\bf Z}_{u}]$ . Both SMSE1 and ML-GRF employ the standard Laplacian, while SMSE2 and ML-LGC the normalized symmetric Laplacian. They all use the cosine similarity or the Gaussian kernel on the columns of ${\bf Y}_{l}$ to compute $\Sigma$ [ 38 ], [ 39 ].
2.4 Unification Frameworks
From the formulations of many of the previous UDR, SDR, and SSDR instances presented in the previous sections, it is apparent that they possess similar forms of trace optimizations involving various constraints on the embeddings. This similarity has been investigated in a number of occasions, where different approaches have been contrasted and parallels have been drawn. The work in [ 63 ] is one of the earliest such works that shows connections between PCA, LPP, and LDA regarding the calculation of their weight matrices used to build the Laplacian ones, and their unsupervised or supervised setup. In [ 64 ], the commonality of LLE, Isomap, LE, PCA, KPCA and MDS was examined in terms of the underlying eigenfunction estimation these methods perform.
The graph embedding framework proposed in [ 23 ] advocates the general form


$$\mathop{\rm min}_{{{{{{\bf Z}\in R^{n\times k},\atop {\bf Z}^T{\bf B}{\bf Z}={\bf I}_{k\times k}}}}}}{\rm tr}[{\bf Z}^T{\bf A}{\bf Z}],$$


(26)



where $\bf A$ is the intrinsic weight matrix and $\bf B$ the scale or penalty Laplacian (for supervised forms) matrix. For $\bf A$ to comply with the standard interpretation of a Laplacian, the zero row-sum constraint is imposed, i.e., ${\bf D}({\bf A})={\bf 0}_{n\times n}$ . In addition to proposing MFA, the authors compare methods, such as PCA, KPCA, LDA, LPP, Isomap, LLE, and LE and also propose the linear, kernelized and tensorized forms of (26). The work of Kokiopoulou et al. [ 65 ] also compares different methods, such as the unsupervised LLE, LE, PCA, MDS, Isomap, LPP, ONPP, NPP, OLPP, the supervised LDA, and the supervised versions of NPP and LPP in terms of (26) and discusses their kernel formulations.
Another framework is the patch alignment one generalized in [ 19 ] based on the design proposed in [ 66 ]. A patch corresponds to a measurement and its $K$ related neighbors. Letting ${\bf S}_i$ denote an $n\times (K+1)$ binary selection matrix that indicates point membership for the $i$ th patch and $N$ the number of patches, the framework performs global optimization according to


$$\min_{{{{{{\bf Z}\in R^{n\times k},\atop {\bf Z}^T{\bf Z}={\bf I}_{k\times k}}}}}}{\rm tr}\left[{\bf Z}^T\left(\sum_{i=1}^N{\bf S}_i{\bf A}_i{\bf S}_i^T\right){\bf Z}\right],$$


(27)



where ${\bf A}_i$ is the matrix encoding the objective for the $i$ th patch. This template can assume the form of different algorithms, such as PCA, LLE, ONPP, Isomap, LE, LPP, LDA, etc., by adapting the definitions of neighborhood and ${\bf A}_i$ for each patch.
2.5 Discussion
In addition to choosing one of the previously described algorithms depending on whether the task at hand is an unsupervised or semi/supervised one, with single or multiple labels, we provide below some further analysis for employing such algorithms according to the way data characteristics are preserved in the embedded space, group them according to different families, and discuss their corresponding advantages and disadvantages.
The first family is the global feature-based one, which contains UDR algorithms such as PCA, LSI, and MDS that preserve global feature-based structure of the data, such as variance, $k$ -rank feature matrix approximation, and euclidean distances of samples, respectively.
Another family preserves the global structure contained within the label vectors, such as the SDR methods of LDA, MMC, CCA/rCCA, PLS/OPLS, MDDM, SFEHS, and HSL. Specifically, the single-label methods LDA and MMC preserve or sharpen the overall class structure by encouraging the embedded points from different classes to be apart, while those from the same class proximal. Regarding the multilabel methods, different versions of the HSL preserve label proximity structures based on global label graphs that encourage samples associated with high label-based similarities to be close in the embedded space. Although the basic idea of CCA/rCCA, PLS/OPLS, MDDM, and SFEHS is to maximize the compatibility between the embeddings and the label vectors based on a certain statistical criterion, after reexpressing them under the proposed template (see Section 3.1 and Table 1 ), it is shown that these methods also encourage points possessing high label-based similarities to be close, as HSL does, but using different measures for evaluating similarities between multioutput labels.

Table 1. Existing Dimensionality Reduction Algorithms Expressed in Terms of the Templates of (30), (31)



Algorithms flagged by $\bf Z$ optimize embeddings directly, while those flagged by $\bf P$ use projections. $\bf W$ is the feature-based weight matrix discussed in Section 2.1.1 and $\circ$ denotes Hadamard multiplication.

A third family contains methods which simultaneously preserve the global structures of both features and labels. Examples are multilabel methods such as SLVM and MORP that seek a shared $k$ -rank approximation for both the feature and label matrices.
The above families preserve global structure for the features, labels, or both. A different family of algorithms is the local one, and contains the neighborhood-based UDR methods, such as USC, NSC, LE, Isomap, LLE, LPP (OLPP), and NPP (ONPP). Specifically, LLE and NPP (ONPP) attempt to preserve the same neighborhood as in the feature space for each sample, while the remaining methods encourage the neighbor points to be distributed closely in the embedded space.
Finally, we can define a family of methods that preserve local feature and global label information, with members the single-label SDR methods of LFDA, MFA, DLA, DNE, OLPP-R, SONPP, DONPP, and ONPP-R, and the SSDR methods. These attempt to preserve the local feature structure and intrinsic geometry of the data in terms of local neighborhoods while sharpening the global class structure in terms of discriminability. Specifically, in the embedded space, MFA, DLA, and DNE encourage the intraclass neighbors to be close, while the interclass neighbor samples far. LFDA encourages intraclass neighbors to be close, while all samples (not just some neighbors) from different classes far. OLPP-R encourages all the points from the same class to be close, while the interclass neighbors are far. SONPP only preserves the local neighborhood within each class separately. DONPP and ONPP-R not only preserve the local structure within each class but also encourage the interclass neighbor points to be far. Compared with the combination-based SSDR methods, such as SELF, SSFDA, and SSMMC, which also preserve the global label and local feature structures, the SSDR methods of SMSE, ML-GRF, and ML-LGC achieve one additional objective, which is the optimal prediction of the label vectors, and thus they can be used as classifiers on their own.
The advantage of methods from the feature-based global and local families is that they are able to discover a low-dimensional "description" space that captures the perceptually meaningful structure of the original feature space, while the disadvantage is they are not able to improve the classification performance when the input features and output labels possess incompatible structures or features have noise. On the contrary, methods from the global label-based family prioritize the mapping of samples into a discriminant space, where the embedded feature structure is especially modified to match the output structure. However, the disadvantage is that the resulting space could be distorted and may lose track of the intrinsic geometry of the original samples, which may lead to an overfitted space with reduced generalization. Instead, the local feature and global label family balances between the preservation of feature and label-based structures by either using the numbers of interclass and intraclass neighbors or a balancing parameter to control the tradeoff between the two types of structures, which is the trend of current embedding techniques.
3. Proposed Methods
3.1 Generic Template Description
Most UDR and single-label SDR algorithms reviewed in Sections 2.1 and 2.2.1 can be represented by the framework of (26) as presented in the corresponding works of Yan et al. [ 23 ], [ 63 ], Kokiopoulou et al. [ 65 ]. However, in some cases it is not possible to enforce the zero row-sum constraint for $\bf A$ (or $\bf B$ ) and view it as a Laplacian matrix. For example, the symmetric Laplacian [ 14 ], many multilabel SDR methods, such as HSL [ 40 ], SLVM [ 37 ], and SFEHS [ 46 ], do not require ${\bf D}({\bf A})={\bf 0}_{n\times n}$ . Regarding the patch alignment framework of [ 19 ], it can represent many models, but it is not as intuitive as the trace optimization, and can allow nonunique representations of the same model.
The need for optimizing with the Laplacian matrix stems from the well-known equivalent forms


$${1\over 2} \sum_{i,j=1}^n w_{ij}\Vert {\schmi{\schmi z}}_{i}-{\schmi{\schmi z}}_{j}\Vert_{2}^2 = {\rm tr}[{\bf Z}^T{\bf L}{\bf Z}] =$$


(28)





$$\eqalign{&{\rm tr}[{\bf Z}^T{\bf D}({\bf W}){\bf Z}] - {\rm tr}[{\bf Z}^T{\bf W}{\bf Z}] \cr &\quad= \sum_{i=1}^n d_{i} \Vert {\schmi{\schmi z}}_{i}\Vert_{2}^2 - \sum_{i,j=1}^n w_{ij}{\schmi{\schmi z}}_{i}^{T}{\schmi{\schmi z}}_{j}, }$$


(29)



where $d_{i}$ is the $i$ th diagonal element of ${\bf D}({\bf W})$ . The last two expressions consist of a weighted sum of pairwise dot products subtracted from a scale measurement term. Thus, with appropriate constraints on $\bf Z$ , minimizing the weighted sum of pairwise distances (or dissimilarities) in (28) is equivalent to maximizing the weighted sum of pairwise products (or similarities) in (29). In order to express all the methods discussed in Section 2 under the trace optimization of (26), it is more suitable to choose its algebraically similar but conceptually different form


$$\max_{{{{{{\bf Z}\in R^{n\times k},\atop {\bf Z}^T{\bf B}{\bf Z}={\bf I}_{k\times k}}}}}}{\rm tr}[{\bf Z}^T{\bf A}{\bf Z}],$$


(30)



where $\bf A$ can be viewed as an arbitrary proximity matrix between $n$ data points and $\bf B$ as a scale or label information constraint matrix. When linear projections are involved where embeddings are recovered via ${\bf Z}={\bf X}{\bf P}$ , this template can be rewritten as


$$\max_{{{\scriptstyle {{{\bf P}\in R^{d\times k},\atop {\bf P}^T{\bf B}_{p}{\bf P}={\bf I}_{k\times k}}}}}}{\rm tr}[{\bf P}^T{\bf A}_{p}{\bf P}],$$


(31)



where ${\bf A}_{p}$ and ${\bf B}_{p}$ are the objective and constraint matrices, respectively. When ${\bf A}_{p}= {\bf X}^T{\bf A}{\bf X}$ , (30) and (31) are simply equivalent, but, as we will see later, for some SSDR methods this need not be the case. The flexibility of this version of the trace template can be exemplified even in the case of PCA and LSI. Although PCA can be expressed by (26) via setting ${\bf W}={1\over n}{\bf 1}_{n\times n}$ , which makes $\bf L$ a centering matrix (and using $-{\bf L}$ for maximization), it cannot express LSI which is equivalent to PCA without the data centering. It is much more direct to express both PCA and LSI as instances of (31). The same can be observed for many other methods contained in Table 1 .
As a general guide to practitioners in the field, we use the above generic trace maximizing templates to reexpress and synopsize in Table 1 most of the methods reviewed in Section 2. The table is split into row sections for UDR, single- and multilabel SDR, and SSDR methods and shows the forms of the corresponding model parameters $\bf A$ and $\bf B$ (or ${\bf A}_{p}$ and ${\bf B}_{p}$ ). It can be seen that the proximity matrices $\bf A$ and ${\bf A}_p$ can be computed from features only, labels only, or both features and labels. The constraint matrices $\bf B$ and ${\bf B}_p$ can be identity matrices imposing orthogonality condition on the embeddings and the projections, respectively. $\bf B$ can also be the diagonal matrix ${\bf D}({\bf W})$ imposing scale or a dissimilarity matrix based on the within and between-class structures. In this case, ${\bf B}_p$ can either be ${\bf X}^{T}{\bf B}{\bf X}$ or its regularized version ${\bf X}^{T}{\bf B}{\bf X}+\lambda {\bf I}_{d\times d}$ .
3.2 Multilabel Extension via Sample Duplication (MESD)
One principal focus of this work is the generalization of SDR for multilabel classification. The multilabel information is generally modeled by an $n\times c$ binary ${\bf Y}=[y_{ij}]$ , where $y_{ij}=1$ if the $i$ th sample belongs to the $j$ th class. Since many sophisticated SDR algorithms have already been proposed for single-label classification, the most direct way to develop an SDR algorithm for multilabel classification is to extend these existing ones to analogous multilabel versions which generate multioutput dependent embeddings. To achieve this, we propose a simple framework MESD that is generic, flexible, and applicable to the existing single-label SDR models. MESD relies on converting the multilabel class information to single-label by duplicating samples which belong to multiple classes and assigning them unique class labels. Similar procedures for transforming a multi- to a single-label dataset for classification tasks have been applied in data mining [ 67 ], [ 68 ], [ 69 ], using various copy, select, and ignore transformations. Here, however, sample duplication is implemented through an efficient mechanism, and is only employed for the calculation of the weights and not the classification stage.
Letting $m_i = \sum_{j=1}^c y_{ij}$ denote the number of classes the $i$ th sample belongs to, each multilabel training sample $({\schmi{\schmi x}}_i,{\schmi{\schmi y}}_i)$ is replaced with $m_i$ new single-label samples $\{({\schmi{\schmi x}}_i^{(j)}, y_i^{(j)})\}_{j=1}^{m_i}$ , where ${\schmi{\schmi x}}_i^{(j)}$ is the $j$ th copy of ${\schmi{\schmi x}}_i$ , and $y_i^{(j)}\in \{1,\;2,\;\ldots,\;c\}$ is the index of the $j$ th nonzero element of the label vector ${\schmi{\schmi y}}_i$ . If we denote by $\tilde{n}=\sum_{i,j=1}^ny_{ij}$ the total number of samples after duplication, a single-label SDR algorithm with an explicit implementation of the duplication procedure would compute an $\tilde{n}\times k$ embedding matrix from the converted $\tilde{n}\times d$ feature matrix. For many single-label SDR algorithms employing the projection constraint ${\bf Z}={\bf X}{\bf P}$ , the same embedding is guaranteed for copies from the same sample since ${\schmi{\schmi z}}_i^{(s)} ={\schmi{\schmi x}}_i^{(s)}{\bf P}={\schmi{\schmi x}}_j^{(t)}{\bf P}={\schmi{\schmi z}}_i^{(t)}$ . But for those which do not follow this projection, the duplication would result in different embeddings for copies from the same sample, which is not always reasonable and could complicate the subsequent classification task.
Nevertheless, to guarantee that for MESD this effect does not occur, i.e., ${\schmi{\schmi z}}_{i}^{(s)}={\schmi{\schmi z}}_i$ , for $s=1,\;2,\;\ldots,\;m_i$ , and also achieve efficient implementation of any existing single-label SDR algorithm with the optimization templates of Section 3.1, we do not need to explicitly implement sample duplication. Instead, we derive MESD by adjusting the template matrices. Specifically, the objective function of (30) can be expressed as


$${\rm tr}[{\bf Z}^T{\bf A}{\bf Z}]= \sum_{i,j=1}^n \left(\sum_{s=1}^{m_i}\sum_{t=1}^{m_j}a^{(s,t)}_{ij}\right){\schmi{\schmi z}}_{i}^T{\schmi{\schmi z}}_{j},$$


(32)



where $a^{(s,t)}_{ij}$ denotes the similarity associated with the sample pair ${\schmi{\schmi x}}_i^{(s)}$ and ${\schmi{\schmi x}}_j^{(t)}$ as necessitated by the employed SDR proximity model. Furthermore, we can impose different weighted versions of (32), such as


$${\rm tr}[{\bf Z}^T{\bf A}{\bf Z}]= \sum_{i,j=1}^n {1\over m_im_j} \left(\sum_{s=1}^{m_i}\sum_{t=1}^{m_j}a^{(s,t)}_{ij}\right){\schmi{\schmi z}}_{i}^T{\schmi{\schmi z}}_{j}.$$


(33)



Hence, the duplication procedure results in a proximity matrix $\bf A$ with elements $a_{ij}= {1\over m_im_j} \sum_{s=1}^{m_i}\sum_{t=1}^{m_j}a^{(s,t)}_{ij}$ , where the confidence value ${1\over m_im_j}$ for each sample pair implies that the more classes the samples belong to, the more overlapping between classes they induce, and consequently, the fewer contributions they should provide in computing the discriminating embeddings.
Now, we explain how to calculate the constraint matrix. As discussed in Section 3.1, various types of this matrix are used for single-label SDR. The orthogonality constraint on the projections ( ${\bf B}_p={\bf I}_{d\times d}$ ) remains the same in the MESD version of the algorithm. The orthogonal constraint on the embeddings ${\bf B}={\bf I}_{n\times n}$ becomes ${\bf B}={\rm diag}([m_1,\;m_2,\ldots,\;m_n])$ after duplication. When the constraint matrix is a dissimilarity matrix based on scatter information, $\bf B$ (or ${\bf B}_p$ ) can be computed by exactly the same inductions used above to obtain $\bf A$ .
To further elucidate the proposed MESD framework, we provide the MESD equivalent trace template formulations of some of the single-label SDR algorithms from Section 2.2.1 in the top section of Table 2 , where the unweighted objective function in (32) is used. MESD based on (33) can be obtained in the same way as in Table 2 , but modifying all the $\bf H$ matrices to $({\bf Y}{\bf Y}^{T})\oslash ({\bf Y}{\bf 1}_{c\times c}{\bf Y}^{T})\circ {\bf H}$ . These formulations show that MESD actually encourages pairwise points possessing high/low similarities to be close/far in the embedded space, with the similarities evaluated by measures controlled by the targeted single-label method and the sample duplication rules.

Table 2. Proposed Dimensionality Reduction Algorithms Expressed in Terms of the Templates of (30), (31)



W is the feature-based weight matrix, G is the label-based matrix, and $\oslash$ denotes Hadamard division.

3.3 Multioutput Proximity-Based Embeddings (MOPE)
A more flexible and generic mechanism we can follow is to design a multilabel SDR algorithm by directly designing an appropriate proximity matrix ${\bf A}=[a_{ij}]$ , with the orthogonality condition imposed on the embeddings (or projections) to comply with the templates of Section 3.1. This gives rise to the proposed MOPE framework, which encourages samples possessing nonzero $a_{ij}$ to be close in the embedded space, according to the degree of "closeness" quantified by $a_{ij}$ . As was previously discussed, different SDR algorithms mainly differ in the adopted proximity and constraint matrices. Some methods (e.g., FDA, MFA) use the constraint matrix $\bf B$ to process interclass information, while others (e.g., MMC, OLPP-R) incorporate such information within $\bf A$ and maintain $\bf B$ for the scale constraints.
Many SDR algorithms from the global label-based family (see Section 2.5) compute $\bf A$ only from the labels. It can be observed that many of these algorithms consider ${\bf Y}{\bf Y}^{T}$ to be of importance in order to construct the label-based proximity structure. ${\bf Y}{\bf Y}^{T}$ indicates whether two samples belong to the same class for single-label data, or the number of classes two samples simultaneously belong to multilabel data. Another important source of information is ${\bf Y}^{T}{\bf Y}$ , typically used to scale the label matrix. It can also be viewed as a similarity matrix between classes, representing information of class size and interclass overlapping. We can also observe that SDR algorithms from the global & local family compute $\bf A$ from both features and labels. Some of these algorithms calculate separately a feature-based weight matrix $\bf W$ (as in Section 2.1.1) and a label-based proximity matrix $\bf G$ , and then define $\bf A$ as a combination of $\bf W$ and $\bf G$ . Examples of this include: ${\bf G}\circ {\bf W}$ for LFDA, $\beta {\bf G}+(1-\beta ){\bf W}$ for SLVM, and ${\bf G}_1+\beta {\bf G}_2\circ {\bf W}$ for OLPP-R combining different versions of G.
Motivated by these commonalities, we base MOPE on a generalized calculation of the feature and label-based proximity matrices $\bf W$ and $\bf G$ , respectively, followed by a merging function ${\schmi \Psi}$ designed to combine these two types of information into the matrix ${\bf A} ={\schmi \Psi} ({\bf W},{\bf G})$ encoding the objective of the optimization template. In the following text, we propose (Section 3.3.1) a broad set of guidelines and schemes to instantiate this framework in terms of computing $\bf G$ , and we develop (Section 3.3.2) an adaptive mechanism for priority-based feature and label information merging.


3.3.1 Label-Based Proximity Matrix $\bf G$ In order to evaluate the similarity between label vectors, we propose three different schemes to compute the proximity matrix $\bf G$ by: 1) working in the binary label space $\{0,1\}^c$ , 2) working in a transformed real label space, and 3) utilizing class similarity information.

Scheme 1. Using a bit-string-based similarity to capture the proximity structure between label vectors ${\schmi{\schmi y}}_i$ is the most direct way to build $\bf G$ . Examples include the MDDM (see Table 1 ) of which one configuration employs an and-based similarity (i.e., the number of the common sample labels) for centered samples, and HSL which scales the and-based similarity by class membership (this gives a cosine norm on binary vectors) and weighs it with class sizes. In the following, we provide alternative ways to measure similarities:

    • Søensen's similarity coefficient, also known as Dice's coefficient:



    $${\rm sim}_{S}({\schmi{\schmi y}}_i,{\schmi{\schmi y}}_j)={2\Vert {\schmi{\schmi y}}_i\wedge {\schmi{\schmi y}}_j\Vert_{1}\over \Vert {\schmi{\schmi y}}_i\Vert_{1}+\Vert {\schmi{\schmi y}}_j\Vert_{1}} .$$


    (34)



    This is equivalent to scaling the and-based similarity with ${2\over \Vert {\schmi{\schmi y}}_i\Vert_{1}+\Vert {\schmi{\schmi y}}_j\Vert_{1}}$ . We can also define a version of a scaled Søensen's coefficient, obtained by weighting with the class sizes $\{n_i\}_{i=1}^c$ , as



    $${\rm sim}_{SS}({\schmi{\schmi y}}_i,{\schmi{\schmi y}}_j)= {2\Vert ({\schmi{\schmi y}}_i\wedge {\schmi{\schmi y}}_j)\oslash {\bf s}\Vert_1\over \Vert {\schmi{\schmi y}}_i\Vert_{1}+\Vert {\schmi{\schmi y}}_j\Vert_{1}},$$


    (35)



    where ${\bf s}=[n_1,\;\ldots,\;n_c]^T={\rm diag}({{\bf Y}^T{\bf Y}})$ .

    • Jaccard similarity coefficient



    $${\rm sim}_{J}({\schmi{\schmi y}}_i,{\schmi{\schmi y}}_j)={\Vert {\schmi{\schmi y}}_i\wedge {\schmi{\schmi y}}_j\Vert_{1}\over \Vert {\schmi{\schmi y}}_i\vee {\schmi{\schmi y}}_j\Vert_{1}} .$$


    (36)



    This can be viewed as the number of common classes scaled by the inverse of the total number of different classes ${\schmi{\schmi y}}_i$ and ${\schmi{\schmi y}}_j$ belong to.

    • Hamming-based similarities



    $${\rm sim}_{H}({\schmi{\schmi y}}_i,{\schmi{\schmi y}}_j)= \exp \left({-\Vert {\schmi{\schmi y}}_i \oplus {\schmi{\schmi y}}_j\Vert_{1}\over \tau } \right) {\rm or}$$


    (37)





    $${\rm sim}_{H}({\schmi{\schmi y}}_i,{\schmi{\schmi y}}_j)= 1 - {\Vert {\schmi{\schmi y}}_i \oplus {\schmi{\schmi y}}_j\Vert_{1}\over c} .$$


    (38)



    While the previous and-based indices rely on the importance of the number of "shared" classes with different scalings, a similarity index based on the Hamming-distance is based the number of "distinct" classes.

Scheme 2. We can also seek the latent similarity between binary label vectors in a transformed and more compact real space. In the first stage, we map each $c$ -dimensional binary label vector ${\schmi{\schmi y}}_i$ to a $k_{c}$ -dimensional real space ( $k_{c}\le c$ ) and obtain a set of transformed label vectors $\{\phi ({\schmi{\schmi y}}_{i})\}_{i=1}^n$ . Of the many ways for achieving this, one is to employ a projection technique that maximizes the variance of the projections $\phi ({\schmi{\schmi y}}_{i})= {\bf P}_y^{T}{\schmi{\schmi y}}_i$ as



$$\max_{{{{{{\bf P}_y\in R^{c\times k_c},\atop {\bf P}_y^T{\bf P}_y={\bf I}_{k_c\times k_c}}}}}} \sum_{i=1}^{n}\left\Vert {\bf P}_y^{T}{\schmi{\schmi y}}_i-{1\over n} \sum_{j=1}^n{\bf P}_y^{T}{\schmi{\schmi y}}_j\right\Vert_2^2.$$


(39)



This is actually PCA in the binary label space, mapping the $c$ -dimensional label vectors into a more compact number of uncorrelated directions. The mappings are generated by the top $k_{c}$ right singular vectors of $({\bf I}_{n\times n}-{1\over n}{\bf 1}_{n\times n}){\bf Y}$ . Other ways to generate the $\phi ({\schmi{\schmi y}}_{i})$ would be to use any of the UDR algorithms described in Section 2.1 or their corresponding kernel versions with $\bf Y$ as the input matrix.

In the second stage of Scheme 2, the following similarity measures can be employed to capture the label-based proximities between the samples $\phi ({\schmi{\schmi y}}_{i})$ :

    • Minkowski-based similarity



    $${\rm sim}_{M}({\schmi{\schmi y}}_i,{\schmi{\schmi y}}_j)=\exp \left({-\Vert \phi ({\schmi{\schmi y}}_{i})-\phi ({\schmi{\schmi y}}_{j})\Vert_{P}^P\over \tau } \right).$$


    (40)



    This is computed from the Minkowski distance passed through a Gaussian to map to a similarity value.

    • Tanimoto similarity coefficient



    $${\rm sim}_{T}({\schmi{\schmi y}}_i,{\schmi{\schmi y}}_j)= {\phi^{T}({\schmi{\schmi y}}_{i})\phi ({\schmi{\schmi y}}_{j})\over \Vert \phi ({\schmi{\schmi y}}_{i}) \Vert_{2}^{2} + \Vert \phi ({\schmi{\schmi y}}_{j})\Vert_{2}^{2}-\phi^{T}({\schmi{\schmi y}}_{i})\phi ({\schmi{\schmi y}}_{j})} .$$


    (41)



    This is an extended cosine similarity (becoming the Jaccard coefficient when applied to binary vectors).

    • Existing feature-based similarity approaches: Any of the techniques from Section 2.1.1 used to compute the weights $\bf W$ can also be employed here.

Scheme 3. In this scheme, we consider the strong degree of dependence the classes $\{{\bf C}_i\}_{i=1}^c$ may possess due to many samples belonging to multiple classes. A flexible way of taking this into account is to define the similarity between any two label vectors as



$${\rm sim}_{S3}({\schmi{\schmi y}}_i,{\schmi{\schmi y}}_j)={1\over m_im_j} \sum_{s=1}^c\sum_{t=1}^cy_{is}y_{jt}\sigma_{st},$$


(42)



or equivalently in matrix form ${\bf G}={\bf D}({\bf Y})^{-1}{\bf Y}\Sigma {\bf Y}^T {\bf D}({\bf Y})^{-1}$ (see also Table 2 ). Equation (42) averages the similarities of pairs of classes either the $i$ th or the $j$ th sample belong to. ${\bf \Sigma} =[\sigma_{ij}]$ is the $c\times c$ similarity matrix between classes and can be constructed via different mechanisms directly from the label matrix $\bf Y$ . The most natural is ${\bf \Sigma} ={\bf Y}^T{\bf Y}$ , where $\sigma_{ij}$ is the number of samples shared by the $i$ th and $j$ th classes. Other more sophisticated ways to compute ${\bf \Sigma}$ can be implemented. For example, by representing each class ${\bf C}_i$ with an $n$ -dimensional binary vector $[y_{1i},\;y_{2i},\;\ldots,\;y_{ni}]^T$ , all similarity indices presented in Schemes 1 and 2 can be used to obtain ${\bf \Sigma}$ , but with ${\bf Y}^T$ used as the input instead. For some domain-specific classification problems, one can also compute ${\bf \Sigma}$ from a feature space where each class is directly representable with a set of domain-specific features. For example, in document classification each document is represented using different words with features being the number of times each word occurs in each document. The task is to assign the document to different categories, such as sports, politics, and science. In this case, one can represent each category also with subsets of words, and class features can either be the number of times the word occurs in those documents belonging to a specific category, given as ${\bf Y}^T{\bf X}$ , or binary indicators showing whether the word occurs in those documents belonging to that specific category. Subsequently, ${\bf \Sigma}$ can be computed from these word features using the techniques from Section 2.1.1.

Discussion. The proximity matrix obtained from the three proposed schemes are formulated in the bottom of Table 2 . Scheme 1 directly computes the similarities from the original label vectors using a string-based measure, while Scheme 2 from a set of projected label vectors using a real vector-based similarity estimation. Within each scheme, different measures would probably lead to $\bf G$ with similar overall structures, but with different matrix element values reflecting the type and degree of "closeness" between the embedded points. It should be mentioned that when the problem at hand has a large number of classes, such as text categorization with large taxonomies [ 70 ], the label matrix $\bf Y$ is usually very sparse due to the lack of training samples for some classes. In this case, Scheme 2 is preferred over Scheme 1, as the projected label vectors provide a more compact, simplified, and robust representation with reduced noise. Compared to Schemes 1 and 2, Scheme 3 is more domain oriented because it incorporates class dependence, and so it potentially models the label-based proximity in a more precise manner, which could be beneficial in applications where class memberships are not independent.



3.3.2 A Priority-Based Combination Mechanism ${\bf\Psi}$ Ideally, if the features can accurately describe all the discriminating characteristics, the proximity structures computed from features and labels should be very similar. However, when processing real data sets, certain characteristics, such as noneasily separable distributions or cases where patterns closely/distantly located in the feature space are from different/same classes, may lead to incompatible proximities in the feature (Section 2.1.1) and label (Section 3.3.1) spaces. This may lead to increased classification errors.

Given two proximity matrices ${\bf W}=[w_{ij}]$ and ${\bf G}=[g_{ij}]$ representing the two types of proximity information sources, most of the existing algorithms (e.g., SLVM and OLPP-R, discussed in the beginning of Section 3.3) although they allow the users to control the degree of preference with a weight parameter, they assume that these two types are equally important. However, given a classification task, the label information of the training samples constitutes a more dominant and driving force of the problem. Also, the feature information is usually more susceptible to noise and imperfections of the employed feature extraction or preprocessing stages. Therefore, to equip the proposed MOPE framework with a more robust information combining mechanism to calculate ${\bf A} \;{=}$ $\Psi ({\bf W},{\bf G})$ , we consider the feature-based proximity as secondary source of information, while the label-based proximity as a primary one. To implement this setup we propose using the combining function:



$$\psi_{ij} = {g_{ij}^a\over 1+\beta \big(1-w_{ij}^b\big)},$$


(43)



where $\psi_{ij}$ is the combined similarity between two data points, and the two information sources do or can be scaled to satisfy $0\le w_{ij}, g_{ij}\le 1$ . This function is a monotonically increasing function of $w_{ij}$ and $g_{ij}$ and is controlled by three parameters: $\beta \ge 0$ to set the lower bound, and $a,\;b\ge 0$ to control the rate of interaction of $w_{ij}$ and $g_{ij}$ . Equation (43) is designed so that the contribution of $w_{ij}$ is primarily restricted by $g_{ij}$ and $w_{ij}$ on its own cannot increase $\psi_{ij}$ too much unless $g_{ij}$ is adequately high to support it. In this way, $\psi_{ij}$ gives priority to $g_{ij}$ over $w_{ij}$ . The overall behavior of (43) is seen from



$$\eqalign{ \lim_{g_{ij}\rightarrow 0}\psi_{ij}&=0,\cr \lim_{g_{ij}\rightarrow 1}\psi_{ij}&={1\over 1+\beta \big(1-w_{ij}^b\big)} =\left\{ \matrix{{1\over 1+\beta } & {\rm if} \; w_{ij}\rightarrow 0,\cr 1 &{\rm if} \; w_{ij}\rightarrow 1,} \right.\cr \lim_{w_{ij}\rightarrow 0}\psi_{ij}&= {g_{ij}^a\over 1+\beta },\cr \lim_{w_{ij}\rightarrow 1}\psi_{ij}&= g_{ij}^a, }$$


and also from the example of Fig. 1 . The hyperparameters $a$ , $b$ , and $\beta$ can be set by the user or tuned by cross-validation procedures to adjust the interaction of the proximity information to fit the problem at hand. When $\beta =0$ , the proximity matrix is only computed from the class information, as in CCA, PLS, MDDM, HSL, MMC, etc. Similar mechanisms for combining sources of information with different priorities have also been used in [ 71 ] in different context. More powerful combining functions could also be designed to generalize or extend (43), e.g., $\psi_{ij} = {\gamma g_{ij}^a\over 1+\beta (1-w_{ij}^b)} +(1-\gamma )w_{ij}^b$ with $0\le \gamma \le 1$ , which supports more options besides the priority-based mechanism, such as $\bf W$ on its own ( $\gamma =0$ and $b=1$ ) and weighted sum of $\bf W$ and $\bf G$ ( $\beta =0$ , $a=1$ , and $b=1$ ).





Fig. 1. Example plot of the combined proximity $\psi_{ij}$ as a function of the feature-based $w_{ij}$ and the label-based weight $g_{ij}$ according to (43) with $a=0.3$ , $b=1.3$ , and $\beta =1.0$ .







We calculate W as in Section 2.1.1 for all sample pairs and $\bf G$ using a method from Schemes 1-3; then, the combined similarities $\{\psi_{ij}\}_{i,j=1}^{n}$ are obtained as described above. Finally, we define ${\bf A} = {\schmi \Psi} ({\bf W},{\bf G})$ using a $K$ -NN search as in Section 2.1.1, using as edge weights either constant values or the $\psi_{ij}$ similarities.

3.4 Semi-Supervised Embeddings
Motivated by various existing SSDR method, such as SSMMC, SSFDA, and SELF described in Section 2.3, we also propose a general framework to obtain semi-supervised embeddings by combining any SDR and UDR models, and employing the optimization template of (31). The objective encoding matrix can be generally formulated as the weighted sum


$${\bf A}_p = \lambda {\bf X}^{T}{\bf A}^{{\rm (UDR)}}{\bf X}+(1-\lambda ){\bf X}_{l}^{T}{\bf A}^{{\rm (SDR)}}{\bf X}_{l},$$


(44)



where ${\bf A}^{{\rm (SDR)}}$ denotes the proximity matrix of the selected SDR method computed from the labeled data ${\bf Y}_{l}$ and/or ${\bf X}_{l}$ , and ${\bf A}^{{\rm (UDR)}}$ the proximity matrix of the selected UDR method computed from all (both labeled and unlabeled) data $\bf X$ .
With respect to the constraint matrix ${\bf B}_p$ , there are two options, with the simplest being to just impose orthogonality on the projections and set it to ${\bf I}_{d\times d}$ . Depending on the user preference, however, another possibility is to use something like the above combined matrix:


$${\bf B}_{p} = \lambda {\bf X}^{T}{\bf B}^{{\rm (UDR)}}{\bf X}+(1-\lambda ){\bf X}_{l}^{T}{\bf B}^{{\rm (SDR)}}{\bf X}_{l},$$


(45)



where, similarly to (44), ${\bf B}^{{\rm (SDR)}}$ is computed from ${\bf Y}_{l}$ and/or ${\bf X}_{l}$ , while ${\bf B}^{{\rm (UDR)}}$ is computed from the entire $\bf X$ . As an example of this framework, consider the semi-supervised one proposed in [ 32 ], which can be considered a special case of the proposed one. This can be seen by using (31) as the objective, with ${\bf A}_p$ defined as in (44), but with the constraint matrix obtained by setting $\lambda =0$ in (45) to get ${\bf B}_{p} ={\bf X}_{l}^{T}{\bf B}^{{\rm (SDR)}}{\bf X}_{l}$ .
Note, that in the proposed framework, if the recruited SDR model is applicable to multilabel data, so is the resulting SSDR algorithm. Table 2 also includes this general expression. The proposed SSDR framework preserves the feature-based proximity structure of all the available samples in the embedded space, and at the same time preserves the label-based (and feature-based) proximity structure of the labeled samples.
3.5 Relation-Based Embeddings
For SDR and SSDR tasks, out-of-sample extensions are needed to generate embeddings for new query points. Therefore, projection-based embeddings, as in (31), are preferred, but in such cases, kernel-based versions are needed to increase the power of the model [ 11 ], [ 23 ], [ 64 ]. We can apply the standard kernel trick [ 72 ] to the formulation of (30), through the use of kernel function ${\cal K}(\cdot,\;\cdot )$ , which defines the dot product in a high-dimensional (possibly infinite dimensional) feature space known as kernel-induced space [ 73 ], [ 74 ]. Letting ${\schmi{\schmi\phi}}_{\kappa }:{\cal R}^{d}\;\mapsto\; {\cal H}$ denote the mapping that transforms the original data to the kernel-induced space ${\cal H}$ , then ${\bf\Phi}_{\kappa }=[{\schmi{\schmi{\phi}}_{\kappa }({\schmi{\schmi x}}_1),\;\ldots,\;{\schmi{\schmi\phi}}_{\kappa }({\schmi{\schmi x}}_n)]^T}$ is the feature matrix in ${\cal H}$ . Further, if ${\bf K}=[k_{ij}]$ denotes the $n \times n$ kernel matrix between the $n$ data points, then $k_{ij}\;{=}$ ${\cal K}({\schmi{\schmi x}}_i,{\schmi{\schmi x}}_j)={\schmi{\schmi \phi}}_{\kappa}^T({\schmi{\schmi x}}_i){\schmi{\schmi \phi}}_{\kappa}({\schmi{\schmi x}}_j)$ . Working in ${\cal H}$ , we are looking for a transformation matrix $\bar{{\bf P}}=[\bar{{\schmi p}}_1,\;\ldots,\;\bar{{\schmi{\schmi p}}}_k]$ to project the kernel-induced features to a $k$ -dimensional subspace, where the embeddings are computed by ${\bf Z}={\bf\Phi}_{\kappa }\bar{{\bf P}}$ . The classical kernel trick expands each of the transformation vectors $\{\bar{{\schmi{\schmi p}}}_j\}_{j=1}^k$ to a linear summation of all the training samples in ${\cal H}$ , given as $\bar{{\schmi{\schmi p}}}_j=\sum_{i=1}^n\gamma_{ij}{{\schmi{\schmi \phi}}}_{\kappa }({\schmi{\schmi x}}_i)$ . Letting ${\bf \Gamma} =[\gamma_{ij}]$ denote the $n\times k$ coefficient matrix, we have ${\bf Z}={\bf \Phi}_{\kappa }\bar{{\bf P}}\;{=}$ ${\bf \Phi}_{\kappa }{\bf \Phi}_{\kappa }^T{\bf\Gamma} ={\bf K}{\bf \Gamma}$ . By incorporating this into (30), the optimal coefficients can be obtained by solving


$$\max_{{{{{{\bf\Gamma} \in R^{n\times k},\atop {\bf\Gamma}^T{\bf B}_{{\bf\Gamma} }{\bf\Gamma} ={\bf I}_{k\times k}}}}}}{\rm tr}[{\bf\Gamma}^T{\bf A}_{{\bf\Gamma} }{\bf\Gamma} ],$$


(46)



where ${\bf A}_{\bf\Gamma }={\bf K}^{T}{\bf A}{\bf K}$ and ${\bf B}_{\bf\Gamma }={\bf K}^{T}{\bf B}{\bf K}$ . If the orthogonality condition is directly imposed to the coefficients so that ${\bf \Gamma}^T{\bf \Gamma} ={\bf I}_{k \times k}$ , we can set ${\bf B}_{\bf\Gamma }={\bf I}_{n \times n}$ . The embeddings for $m$ query samples can then be computed by $\tilde{{\bf Z}}=\tilde{{\bf K}}{\bf\Gamma}$ , where $\tilde{{\bf K}}$ denotes the $m\times n$ kernel matrix between the queries and the training samples.
From (46) and (31), we observe that the kernel-based projection works in a similar way to the one performed in the original space, but with $\bf K$ used as the input feature matrix instead of $\bf X$ , and ${\schmi \Gamma}$ used as the projection matrix instead of $\bf P$ . Since the kernel matrix can be viewed as a similarity structure between samples, kernel-based projection is equivalent to using a set of similarity values to training samples as the input features for computing embeddings. Therefore, we can drop the positive semidefinite property of the kernel function, and use an arbitrary set of (dis)similarity values to the training samples (the relation features) calculated, for instance, using the measures listed in Section 2.1.1, as the input features.
By employing the relation features as the input of the dimensionality reduction algorithm, one not only can discover the nonlinear structure of the data, but also reduce the computational complexity for data sets with large-scale features ( $d\gg n$ ). Since the eigen-decomposition of an $n\times n$ matrix is required instead of a $d\times d$ one, the cost is reduced to $O(n^3)$ ( $n \ll d$ ) [ 75 ]. It is also possible to select a collection of $p$ ( $p\le n$ ) seed prototypes instead of using all the training samples, in order to compute $p$ relation features. In this case, the computational cost can be further reduced to $O(p^3)$ ( $p\le n \ll d$ ). Since previous research [ 76 ], [ 77 ] has shown that (dis)similarity values can be used as input features to build good classifiers, we can expect the discriminating ability of the embeddings computed from these relation features to be similar to that of the original features.
4. Experiments
4.1 Used Data Sets
In order to examine the performance of our proposed frameworks for SDR/SSDR, two multilabel text categorization problems with large-scale features are studied.

    Reuters documents. The "Reuters-21578 Text Categorization Test Collection" contains articles taken from the Reuters newswire, 1 where each article is designated into one or more semantic categories. A total number of 9,980 articles from 10 overlapped categories were used in our experiments, with each category having between 400 and 4,000 articles. We randomly divide the articles from each category into three partitions with nearly the same size, for the purpose of training, validation, and test. This leads to 3,328 articles for training and 3,326 articles for validation and test, respectively. There are around 18 percent of these articles belonging to two to four different categories at the same time, while each of the rest belong to a single category.

    Education evidence portal (EEP) documents. A collection of documents supplied by EEP, 2 where each document is a quite lengthy full paper or report (approximately 250 KB on average after converting to plain text). Domain experts have developed a taxonomy of 107 concept categories in the area and manually assigned categories to documents stored in the database. This manual effort has resulted in 2,149 documents, including 1,928 training documents and 221 test documents, with mostly multiple categories assigned. Among these 107 categories, the five biggest ones contain more than 500 documents each, while most of the remaining ones contain between 1 to 300 documents. There are around 96 percent of these documents assigned 2 to 17 different categories, while each of the remaining go to one only category.

The numerical features for classification were extracted as follows: We first applied Porter's stemmer 3 to the documents, then extracted word unigrams from each document. For the Reuters documents, after filtering the low-frequency words, the term frequency-inverse document frequency (tf-idf) values of 24,012 word unigrams are used as the original features. This leads to a $3{,}328\times 24{,}012$ feature matrix $\bf X$ for the training samples and a $3{,}326\times 24{,}012$ feature matrix $\tilde{{\bf X}}$ for the query samples, in both the validation and test procedures. For the EEP documents, the corresponding frequencies of the word unigrams, representing the number of times the terms occur in the documents, are used as the original features. This leads to a $1{,}928\times 1{,}721{,}881$ feature matrix $\bf X$ for the training samples and a $221\times 1{,}721{,}881$ feature matrix $\tilde{{\bf X}}$ for the test samples.
4.2 Experimental Setup
In the experiments, we aim at comparing the existing and proposed dimensionality reduction methods under the same conditions. 4 The discriminating ability of the same number of embeddings computed from the same set of relation features was evaluated by the same classifier. Two sets of experiments were conducted:

    Experiment 1 was conducted for SDR. The proposed MESD extension was applied to three single-label SDR methods, including FDA, LFDA, and OLPP-R. The three schemes to compute the label-based proximity matrix $\bf G$ of MOPE were also compared. The proposed combination mechanism in (43) was compared with the Hadamard multiplication ${\bf G}\circ {\bf W}$ used by LFDA and the weighted sum $\beta {\bf G}+(1-\beta ){\bf W}$ used by SLVM, with the same configuration of $\bf G$ and $\bf W$ . The proposed methods were also compared with three existing UDR methods (LSI, PCA, and OLPP) and six existing multilabel SDR methods (MDDM, CCA, the three versions of HSL, and MORP).

    Experiment 2 was conducted for SSDR. Only part of the training samples were used as labeled data and only the resulting embeddings of the labeled data were used to train the classifier. We combined OLPP and the best performing MOPE instance from experiment 1, using the proposed SSDR framework, leading to semi-supervised MOPE (SSMOPE). The results were compared with those obtained by solely using OLPP or MOPE, as well as the existing semi-supervised multilabel algorithm ML-LGC.

In both experiments, the relation features computed with the euclidean distance were used for the Reuters documents, while cosine similarity was used for the EEP documents. In experiment 1, $k= 1{,}800$ and $k=500$ embeddings were computed for the Reuters and EEP documents, respectively, while, correspondingly, $k= 800$ and $k=300$ were used for experiment 2. These values of $k$ resulted from a rough grid search with steps of 100, based on cross validation. To obtain $\bf W$ , the Gaussian kernel was used for the Reuters documents, and cosine similarity for the EEP. To obtain $\bf G$ for MOPE, we implemented Scheme 1 with Søensen's, scaled Søensen's, Jaccard and Hamming-based similarities, Scheme 2 with cosine similarities between latent label vectors projected by PCA, and Scheme 3 with ${\bf \Sigma}$ computed by applying Søensen's similarity coefficient to ${\bf Y}^T$ . The number $K$ of $K$ -NNs used for graph construction, the regularization and balancing (or tradeoff) parameters of different existing algorithms, and the SSDR framework, as well as the parameters $a$ , $b$ , and $\beta$ of our priority-based function ${\schmi \Psi}$ were tuned by the validation set for the Reuters documents, and by 3-fold-cross validation using the training set for the EEP documents. For each class and all embedding methods, a binary classifier was trained using Fisher's linear discriminant analysis (FLDA). The macro $F_{1}$ score was employed to evaluate the classification performance. This is the mean of the $F_1$ scores from all classes, and is evaluated via $F_{1}={2{\rm Precision}\times {\rm Recall}\over {\rm Precision}+{\rm Recall}}$ , where ${\rm Precision}\;{=}$ ${{\rm TP}\over {\rm TP}+{\rm FP}}$ , ${\rm Recall}={{\rm TP}\over {\rm TP}+{\rm FN}}$ , TP denotes true positive, TN true negative, FP false positive, and FN false negative samples. All the experiments were conducted using MATLAB R2011a on a 3.06 GHz CPU and 2 GB memory running MAC OS X.
4.3 Results


4.3.1 Experiment 1 First, we compare the label-based proximity matrices for MOPE computed with the proposed schemes, with the label-based matrices used by CCA, the two versions of HSL with the highest and lowest classification errors, as well as the feature-based proximity matrix used by OLPP. Figs. 2 and 3 present a visual qualitative comparison between these matrices and also include their corresponding numerical classification performances. To delineate the potential class structure of these proximity matrices, we have reordered the training samples in a way that samples with more similar label vectors are grouped together; to do that we applied the data seriation algorithm coVAT [ 78 ] on the label matrix $\bf Y$ . It can be seen that the feature-based proximity matrix $\bf W$ of the EEP documents in Fig. 3 a does not possess obvious class structure and is incompatible to many of the label-based proximity matrices $\bf G$ in Fig. 3 . On the contrary, the $\bf W$ matrix of Reuters documents in Fig. 2 a possesses roughly similar class structure to those $\bf G$ matrices in Fig. 2 , though less distinct. This indicates that the EEP set includes more noisy features than the Reuters set. It can be seen from Figs. 2 and 3 that compared to the label-based proximity matrices used by CCA and HSL, our proposed schemes generate proximity matrices with sharper class structures and possess comparatively higher classification performance, especially for the EEP set. Among the three proposed schemes, Scheme 3 works the best for both data sets, and therefore it is used to compute the combined proximity matrix for MOPE.



Fig. 2. Comparison of the feature-based and label-based proximity matrices using the Reuters documents. OLPP is based on the feature-based proximity matrix $\bf W$ , while the remaining methods are based on different versions of the label-based proximity matrices $\bf G$ . Classification performance is shown in parentheses in terms of the macro $F_1$ score.











Fig. 3. Comparison of the feature- and label-based proximity matrices using the EEP documents. OLPP is based on the feature-based proximity matrix $\bf W$ , while the remaining methods are based on different versions of the label-based proximity matrices $\bf G$ . Classification performance is shown in parentheses in terms of the macro $F_1$ score.







We also compare the performance of various existing and proposed SDR and UDR methods in Fig. 4 , and MOPE equipped with the proposed combination mechanism as well as two existing ones (Hadamard and weighted sum; see Table 3 ). It can be seen that the embeddings obtained by SDR possess better discriminating ability than those obtained by UDR, but with the existing SDR method MDDM not performing well for either data sets. All the multilabel SDR algorithms derived from the MESD and MOPE frameworks provide performance similar to or better than existing methods. The top three best-performing algorithms are all generated by MOPE and MESD for both data sets (see Fig. 4 ). The proposed combination mechanism for MOPE performs better than the Hadamard and weighted sum ones, which sometimes even reduce the performance obtained by using $\bf G$ or $\bf W$ alone, as seen in Table 3 . The performance improvement achieved by using advanced SDR methods (e.g., HSL, MESD, and MOPE) is higher for the EEP data set (3 to 7 percent) than Reuters (0.2 to 1.2 percent), as EEP seems to suffer from the noisy features.





Fig. 4. Performance comparison for different embedding methods. Circles represent different versions of: HSL, MOPE with label-based proximity $\bf G$ only (see also the schemes in Figs. 2 and 3 ), and MESD with the extended single-label SDR algorithms. The best performances obtained by the proposed methods are marked in red.







Table 3. Comparison of Different Combination Schemes Using the $F_1$ Score



We also demonstrate the reduction of computational cost by using the relation features instead of the original ones. With methods expressed by (31) based on linear projections, one needs to decompose or compute the inverse of a $d\times d$ matrix, where the original features are $d = 1{,}721{,}881$ for the EEP and $d = 24{,}012$ for the Reuters set. This makes it highly impractical to perform the classification task. By using the relation features, the computational cost is only related to the number of training samples $n$ . This corresponds to less than 300 secs for both data set, and no more than 2,500 secs for the comparatively larger Reuters set when generalized eigen-decomposition is involved.



4.3.2 Experiment 2 The SSDR performance for each data set averaged over all classes, is shown in Table 4 where different percentages of the training samples are used as the labeled data (40 or 80 percent). It can be seen that the proposed SSDR framework of Section 3.4 (whose instance chosen here, referred to as SSMOPE, combines the UDR method OLPP and an SDR instance derived from the MOPE framework of Section 3.3) outperforms the existing ML-LGC. As can be seen by the 40 percent rows of the table, when there is not enough labeled data available, the embeddings computed from both labeled and unlabeled data can possess better discriminating ability than those computed from only labeled data, as in MOPE, or by ignoring label information altogether, as in OLPP here.

Table 4. SSDR Performance in $F_1$ Score for Both Data Sets



5. Conclusion
Our work has focused on SDR and SSDR for multilabel classification. We have summarized many current UDR, SDR, and SSDR techniques and formulated them under a common template representation (Section 3.1 and Table 1 ). Two frameworks have been proposed to achieve multilabel SDR: MESD, which enables the extension of all existing single-label SDR to multilabel (Section 3.2), and MOPE, which is a general multilabel design framework and offers a flexible methodology for constructing the proximity structures between samples based on the feature and label information (Section 3.3). A broad set of different schemes has been provided to discover the label-based proximity structure between samples (Section 3.3.1), as well as a priority-based mechanism to combine label-based and feature-based weight information (Section 3.3.2). A general methodology for achieving SSDR by combining existing UDR and SDR methods has also been provided (Section 3.4). We have also discussed the computation of nonlinear embeddings by using relation features (Section 3.5), and how such processing can reduce the computational demands when using large-scale input features. Experimental results demonstrate the effectiveness of the proposed methodologies by comparing with various existing UDR, SDR, and SSDR methods.

Acknowledgments

This work was funded by the UK Biotechnology and Biological Sciences Research Council (BBSRC project BB/G013160/1 Automated Biological Event Extraction from the Literature for Drug Discovery).

    T. Mu and S. Ananiadou are with the National Centre for Text Mining (NaCTeM), School of Computer Science, University of Manchester, MIB Building, 131 Princess Street, Manchester M1 7DN, United Kingdom.

    E-mail: tingtingmu@me.com, sophia.ananiadou@manchester.ac.uk.

    J.Y. Goulermas is with the Department of Electrical Engineering and Electronics, University of Liverpool, Brownlow Hill, Liverpool L69 3GJ, United Kingdom. E-mail: j.y.goulermas@liverpool.ac.uk.

    J. Tsujii is with Microsoft Research Asia, Beijing, China.

    E-mail: jtsujii@microsoft.com.

Manuscript received 18 Oct. 2010; revised 6 Aug. 2011; accepted 25 Dec. 2011; published online 9 Jan. 2012.

Recommended for acceptance by K. Murphy.

For information on obtaining reprints of this article, please send e-mail to: tpami@computer.org, and reference IEEECS Log Number TPAMI-2010-10-0799.

Digital Object Identifier no. 10.1109/TPAMI.2012.20.

1. http://archive.ics.uci.edu/ml/support/Reuters-21578+Text+Cate gorization+Collection.

2. http://www.eep.ac.uk.

3. http://tartarus.org/~martin/PorterStemmer/.

4. Our code can be downloaded from http://pcwww.liv.ac.uk/ ~goulerma/software/mesd-mope.zip.

References





Tingting Mu received the BEng degree in electronic engineering and information science from the University of Science and Technology of China, Hefei, China, in 2004, and the PhD degree in electrical engineering and electronics from the University of Liverpool, Liverpool, United Kingdom, in 2008. She is currently a postdoctoral researcher with the School of Computing, Informatics and Media, University of Bradford, United Kingdom. Her current research interests include machine learning and pattern recognition, with applications to text mining, bioinformatics, biomedical engineering, and intelligent transportation systems. She is a member of the IEEE.





John Yannis Goulermas received the BSc degree (first class) in computation from the University of Manchester Institute of Science and Technology (UMIST), Manchester, United Kingdom, in 1994, and the MSc degree by research and the PhD degree from the Control Systems Center, Department of Electrical and Electronic Engineering, UMIST, in 1996 and 2000, respectively. He is currently a reader in the Department of Electrical Engineering and Electronics, University of Liverpool, Liverpool, United Kingdom. His current research interests include machine learning, mathematical modeling/optimization, and image processing with application areas biomedical engineering and industrial monitoring and control. He is a senior member of the IEEE.





Jun'ichi Tsujii is a principal researcher with Microsoft Research Asia, China. Until March, 2011 he was a professor of text mining in the School of Computer Science, University of Manchester, and a professor of computer science at the University of Tokyo. He has worked since 1973 in NLP, question answering, text mining, and machine translation. His recent research achievements include: deep semantic parsing based on feature forest model, efficient search algorithms for statistical parsing, improvement of estimator for maximum entropy model, and construction of the gold standard corpus (GENIA) for biomedical text mining. He has authored more than 300 papers in journals, conferences, and books. He was the president of the Association for Computational Linguistics (ACL) in 2006 and has been a permanent member of the International Committee on Computational Linguistics (ICCL) since 1992. He received the IBM Science Award (1988), the IBM Faculty Award (2005), and the Medal of Honor with Purple Ribbon from the Government of Japan in 2010.





Sophia Ananiadou is a director of the National Centre for Text Mining (NaCTeM) and a professor in Text Mining in the School of Computer Science, University of Manchester. She is the main designer of the text-mining tools and services currently used in NaCTeM, i.e., terminology management, information extraction, intelligent searching, and association mining. Her research projects include text mining-based visualization of biochemical networks, data integration using text mining, building biolexica and bio-ontologies for gene regulation, automatic event extraction of bioprocesses, as well as EC FP7 project MetaNet4U and research industrially funded by Pfizer/AstraZeneca/IBM/BBC. She has been awarded the Daiwa Adrian prize (2004) and the IBM UIMA innovation award (2006, 2007, 2008) for her leading work on text-mining tools in biomedicine. She has more than 160 publications in journals, conferences, and books.