*bipartite*ranking problem. Here, a training data set , from an instance space , is given, where the instances come from a set of two categories, positive and negative, represented as {

**p**,

**n**}. Using , the goal is to learn a ranking function that ranks future positive instances higher than negative ones. In other words, the function is expected to assign higher values to positive instances than to negative ones. Then, the instances can be ordered using the values provided by the ranking function.

^{22}]. It has become a popular performance measure in the machine learning community after it was realized that accuracy is often a poor metric to evaluate classifier performance [

^{33}], [

^{41}], [

^{42}].

**p**class and all others are predicted to be

**n**class. Therefore, for each threshold value, a separate confusion matrix is obtained. The number of confusion matrices is equal to the number of ROC points on an ROC graph. With the method proposed by Domingos [

^{15}], it is possible to obtain ROC curves even for algorithms that are unable to produce scores.

**p**or

**n**, be ranked by some scoring function. Given a threshold value , instances whose score is below are predicted as

**n**, and those with score higher than are predicted as

**p**. For a given threshold value,

*TP*is equal to the number of positive instances that have been classified correctly and

*FP*is equal to the number of negative instances that have been misclassified.

*TPR*= true positive rate) versus the fraction of false positives out of the negatives (

*FPR*= false positive rate). The values of

*TPR*and

*FPR*are calculated by using (1). In this equation, is the number of total negative instances and is the number of total positive instances

(1)

*FPR*,

*TPR*) value is computed. The ROC space is a 2D space with a range of [0, 1] on both axes. In ROC space the vertical axis represents the true positive rate of a classification output, while the horizontal axis represents the false positive rate. That is, each (

*FPR*,

*TPR*) pair corresponds to a point on the ROC space. In a data set with distinct classifier scores, there are ROC points, including the trivial (0, 0) and (1, 1) points.

^{5}] proposes the area under the ROC curve as a performance measure.

^{22}].

^{22}] proposed efficient algorithms for generating the ROC points and computing the AUC.

^{30}].

^{35}]. The MAD2C method discretizes a continuous feature in such a way that results in a set of categorical values so that the AUC of the new categorical feature is the maximum. At this point, all the features are categorical. Then, the score for a value of a feature is assigned to be the probability of a training instance that has the value for the feature to have the class label

**p**. During the computation of the score values, instances whose value for the feature is missing are simply ignored. For each feature, the values are sorted in the increasing order of their score, and the AUC is computed. Finally, the weight of a feature is computed as, where

*AUC*( ) is the AUC obtained on the feature . That is, the weight of a feature is a linear function of its AUC value calculated using the training instances whose value for that feature is known. A higher value of AUC for a feature is an indication of its higher relevance in determining the class label. For example, if the AUC computed for a feature is 1, than it means that all instances in the training set can be ranked by using only the values of . Hence, we can expect that new query instances can also be ranked correctly by using only. The training method of the RIMARC algorithm is given in Algorithm 2.

**p**. It is only a rough estimate of the probability, since it is very likely that no other instance with exactly the same feature values has been observed in the training set. The ranking function of the RIMARC algorithm determines this estimated probability by computing the weighted average of probabilities computed on single features, as shown in Algorithm 1.

Algorithm 1.The ranking function.

Algorithm 2.Training in RIMARC.

^{24}] to create better prediction models for classifiers.

(2)

(3)

**p**-labeled instances with value . Hence,

(4)

(5)

**p**and one instance with label

**n**, it is possible that for some values of , may be 0. In such cases the ranking function defined as will have an undefined value. To avoid such problems, the RIMARC algorithm defines the ranking function as

(6)

**p**label among all instances with value . Such a probability value is easily interpretable by humans.

**4.1.1 The Effect of the Choice of the Class Labeling on the AUC**To calculate the and values, one of the classes should be labeled as

**p**and the other class as

**n**, but one can question the effect this choice has on the AUC value. It is possible to show that the AUC value of a categorical feature is independent from the choice of class labels by using the value from the Wilcoxon-Mann-Whitney statistics.

In (7), the AUC formula based on the Wilcoxon-Mann-Whitney statistics is given. The set represents the **p-**labeled instances and represents the **n**-labeled instances. is the ranking of the th instance in the set, similarly, , is the ranking of the th instance in the set

(7)

The dividend part of the AUC formula in (7) counts the number of **p-**labeled instances for each element of the set whose ranking is higher than any element of the set. Then, AUC is calculated by dividing this summation by the multiplication of the **p**-labeled and **n**-labeled elements.

It is straightforward that the divisor part of the AUC formula is independent of the choice of the class labels. Assume that the ranking function given in (6) is used on the data set , and and sets are formed. Let be the number of **n**-labeled instances whose ranking is lower than the th element of the set and let be the score assigned to this element. When the classes are swapped, the new ranking score is equal to . Using this property, all instance scores are subtracted from 1. However, this operation simply reverses the ranking of the instances. So the formula in (7), which calculates the value of AUC, using the ranking of the instances, is independent of the choice of class labeling.

**4.1.2 An Example Toy Data Set**Consider a toy training data set with a single categorical feature given in Table 1. To calculate the AUC value of this particular feature, assume that the ranking scores are calculated by the function in (6). The ranking scores of the categorical values are as follows: , and . The version of the data set sorted by the ranking function is given in Table 2. In this example, the value of is 7 and the value of is 6. The AUC value of this feature is calculated by using (7), as . When the class labels are swapped (

**n**labels are replaced by

**p**labels and vice versa) the ranking scores are also swapped. The sorted version of the swapped toy data set is given in Table 3. Since the relative ranking of the instances does not change, the AUC value of the new ranking is also is 0.82.

Table 1. Toy Training Data Set with One Categorical Feature

Table 2. Toy Training Data Set with Score Values

Table 3. Training Data Set with Class Labels Swapped

**p**to 1 and each real value with the class label

**n**to 0. This risk function will result in the maximum possible value for AUC, which is 1.0. However, such a risk function will over fit the training data, and will be undefined for unseen values of the feature, which are very likely to be seen in a query instance. The first requirement for a risk function for a continuous feature is that it must be defined for all possible values of that continuous feature. A straightforward solution to this requirement is to discretize the continuous feature by grouping all consecutive values with the same class value to a single categorical value. The cut off points can be set to the middle point between feature values of differing class labels. The ranking function, then, can be defined using the function given in (6) for categorical features. Although this would result in a ranking function that is defined for all values of a continuous function, it would still suffer from the over fitting problem. To overcome this problem, the RIMARC algorithm makes the following assumption.

**Assumption.** *For a continuous feature, either the higher values are indicators of the p class and lower values are indicators of the of the n class, or vice versa.*

**4.2.1 The MAD2C Method**As explained above, the RIMARC algorithm requires all features to be categorical. Therefore, the continuous features in a data set need to be converted into categorical ones, through discretization. The aim of a discretization method is to find the proper cut-points to categorize a given continuous feature.

The MAD algorithm given in [ ^{35}] is defined for multiclass data sets. It is designed to maximize the AUC value by checking the ranking quality of values of a continuous feature. A special version of the MAD algorithm, called MAD2C, defined for two-class problems, is used in RIMARC.

The MAD2C algorithm first sorts the training instances in ascending order. Sorting is essential for all discretization methods to produce intervals that are as homogenous as possible. After the sorting operation, feature values are used as hypothetical ranking score values and the ROC graph of the ranking on that feature is constructed. The AUC of the ROC curve indicates the overall ranking quality of the continuous feature. To obtain the maximum AUC value, only the points on the convex hull must be selected. The minimum number of points that form the convex hull is found by eliminating the points that cause concavities on the ROC graph. In each pass, the MAD2C method compares the slopes in the order of the creation of the hypothetical lines, finds the junction points (cut-points) that cause concavities and eliminates them. This process is repeated until there is no concavity remaining on the graph. The points left on the graph are the cut-points, which will be used to discretize the feature.

It has been proven that the MAD2C method finds the cut-points that will yield the maximum AUC. It is also shown that the cut-points found by MAD2C never separate two consecutive instances of the same class. This is an important property, expected from a discretization method. The MAD2C method is preferred among other discretization methods since it has been shown in empirical evaluations that MAD2C has lower running time in two-class data sets and helps the classification algorithms to achieve a higher performance than other well-known discretization methods on AUC basis [ ^{35}].

**4.2.2 A Toy Data Set Discretization Example**To visualize the discretization process using the MAD2C method, a toy data set is given in Table 4. After the sorting operation, the ROC points are formed. The ROC graph for the example in Table 4 is given in Fig. 3. In this example, the higher F1 values are indicators of the

**p**class. If the lower values of feature F1 were indicators of the

**p**class, then the

**p**and

**n**class labels would be swapped and the ROC graph below the diagonal would have been obtained. Note that these two ROC graphs are symmetric about the diagonal.

Table 4. A Toy Data Set for Visualizing MAD2C

The first pass of the MAD2C algorithm is shown in Fig. 4. All points below or on the diagonal are ignored since they have no positive effect on the maximization of AUC. Then the points causing concavities are eliminated. MAD2C converged to the convex hull in one pass for this example, as shown in Fig. 4. The set of points left on the graph are reported as the discretization cut-points.

(8)

(9)

^{11}] showed that the problem of finding the ordering that agrees best with a learned preference function is NP-Complete. As a solution, this weighting mechanism is used as a simple heuristic to extend this maximization over the whole feature set.

**p**-labeled, given that the value of feature in is and is the weight of the feature , computed by (8). Finally, to obtain the weighted average, the sum of all weighted score values is divided by the sum of the weights of all used (known) features. That is, is the weighted probability of the instance has the label

**p**.

^{35}] excluding sorting time. After discretizing the numerical features, the time complexity of the RIMARC algorithm is , where is the number of features and is the average number of categorical values per feature. That is, the training time complexity of the RIMARC is bounded by that of the MAD2C algorithm. The time complexity of the querying step is simply ( ).

^{25}]. We chose 10 data sets with two class labels. The properties of the data sets are given in Table 5.

Table 5. Real Life Data Sets for Evaluation

^{28}]. Since the AUC values are used for the measure of the predictive performance, only the classification methods that provide prediction probability estimates are employed in experimental results. However, the SVM algorithms do not provide these probabilities directly. Therefore, the LibSVM library is used for the SVM algorithm [

^{9}], since it provides probability estimates using the second approach proposed by Wu et al. [

^{50}]. To compare RIMARC with ranking algorithms that try to maximize the AUC directly, the SVM-Perf is also chosen, since the source code is available from the author [

^{34}].

^{13}], [

^{29}]. Therefore, it is important to show that RIMARC can outperform accuracy-maximizing algorithms statistically significantly, as well the ranking algorithms based on classifiers.

^{48}]. This statistical method is chosen, since it is a nonparametric method and does not assume normal distribution as paired -test does. According to the Wilcoxon signed-rank test on a 95 percent confidence level (the same level will be used for other statistical tests), RIMARC statistically significantly outperforms 17 of the 27 machine learning algorithms. These algorithms include naïve Bayes, decision trees (PART, C4.5) and SVM with an RBF kernel and SVM-Perf with an RBF kernel. The RIMARC algorithm outperformed the other 10 algorithms, as well, but the differences between the averages for these algorithms are not statistically significant.

Table 6. Predictive Performance Comparison

^{19}] is used on the naïve Bayes method to improve predictive performance. Since Dougherty et al. [

^{16}] showed that using discretization improves naïve Bayes predictive performance, an improvement was expected. However, as Table 6 shows, the discretization did not improve the performance of naïve Bayes at all. The C parameter is optimized by using inner cross validation for J48 algorithm (decision tree). The C-value is ranged from 0.1 up to 0.5 in 10 s.gif.

^{32}] has pointed out that most of the data sets in the UCI repository are such that, for classification, their attributes can be considered independently of each other, which explains the success of the RIMARC algorithm on these data sets. Similar observation is also made by Güvenir and irin [

^{27}].

Table 7. Running Time Performance Comparison

^{49}]. These methods outperformed by RIMARC are indicated by the ++ symbol in Table 7. Seven algorithms significantly outperformed RIMARC. These algorithms are shown with a symbol. The differences between the other five methods in the table and RIMARC are not significant. Note that all of the algorithms that significantly outperformed the RIMARC algorithm in terms of the running time are outperformed by RIMARC in terms of the AUC metric.

*bipartite ranking problem*, which refers to the problem of learning a ranking function from a training set of examples with binary labels [

^{2}], [

^{10}], [

^{26}]. Agarwal and Roth [

^{2}] studied the learnability of bipartite ranking functions and showed that learning linear ranking functions over Boolean domains is NP-hard. In the bipartite ranking problem, given a training set of instances with labels either positive or negative, one wants to learn a real-valued ranking function that can be used for an unseen case to associate a measure of being close to positive (or negative) class. For example, in a medical domain, a surgeon may be concerned with estimating the risk of a patient who is planned to undergo a serious operation. A successful ranking (or scoring) function is expected to return a high value if the operation carries high risks for that patient. Specific ranking functions have been developed for particular domains, such as information retrieval [

^{18}], [

^{47}], finance [

^{6}], medicine [

^{12}], [

^{14}], [

^{44}], fraud detection [

^{20}], and insurance [

^{17}]. Some of these methods are dependent on statistical models while some are based on machine learning algorithms. A binary classification algorithm that returns a confidence factor associated with the class label can be used for bipartite ranking, where the confidence factor associated with a positive label (or the complement associated with a negative label) can be taken as the value of the ranking function.

^{5}], [

^{22}]. It has been shown that the AUC represents the probability that a randomly chosen positive instance is correctly assigned a higher rank value than a randomly selected negative instance. Further, this probability of correct ranking is equal to the value estimated by the nonparametric Wilcoxon statistic [

^{30}]. Also, AUC has important features such as insensitivity to class distribution and cost distributions [

^{5}], [

^{22}], [

^{33}]. Agarwal et al. [

^{1}] showed what kind of classification algorithms can be used for ranking problems and proved theorems about generalization properties of AUC.

^{31}], [

^{39}], [

^{51}]. For example, Ataman et al. [

^{3}] proposed a ranking algorithm by maximizing AUC with linear programming. Brefeld and Scheffer [

^{7}] presented an AUC maximizing support vector machine. Rakotomamonjy [

^{43}] proposed a quadratic programming-based algorithm for AUC maximization and showed that under certain conditions 2-norm soft margin support vector machines can also maximize AUC. Toh et al. [

^{46}] designed an algorithm to optimize the ROC performance directly according to the fusion classifier. Ferri et al. [

^{23}] proposed a method to locally optimize AUC in decision tree learning, and Cortes and Mohri [

^{13}] proposed boosted decision stumps. To maximize AUC in rule learning, several algorithms have been proposed [

^{4}], [

^{21}], [

^{40}]. A nonparametric linear classifier based on the local maximization of AUC was presented by Marrocco et al. [

^{38}]. A ROC-based genetic learning algorithm has been proposed by Sebag et al. [

^{44}]. Marrocco et al. [

^{37}] used linear combinations of dichotomizers for the same purpose. Freund et al. [

^{26}] gave a boosting algorithm combining multiple rankings. Cortes and Mohri [

^{13}] showed that this approach also aims to maximize AUC. A method by Tax et al. [

^{45}] that weighs features linearly by optimizing AUC has been applied to the detection of interstitial lung disease. Ataman et al. [

^{3}] advocated an AUC-maximizing algorithm with linear programming. Joachims [

^{34}] proposed a binary classification algorithm by using SVM that can maximize AUC. Ling and Zhang [

^{36}] compared AUC-based tree-augmented naïve Bayes (TAN) and error-based TAN algorithms; the AUC-based algorithms are shown to produce more accurate rankings. More recently, Calders and Jaroszewicz [

^{8}] suggested a polynomial approximation of AUC to optimize it efficiently. Linear combinations of classifiers are also used to maximize AUC in biometric scores fusion [

^{46}]. Han and Zhao [

^{29}] proposed a linear classifier based on active learning that maximizes AUC.

**p**class for each value of that feature. The MAD2C algorithm used by RIMARC discretizes continuous features in a way that yields the maximum AUC, as well. The RIMARC algorithm used AUC values of features as their weights in computing the ranking function. With this simple heuristic, we computed the weighted average of all feature value scores to achieve maximum AUC over the whole feature set. Since the RIMARC algorithm uses all available feature values and ignores the missing ones, it is robust to missing feature values.

*The authors are with the Department of Computer Engineering, Bilkent University, Ankara 06800, Turkey.*

*E-mail: guvenir@cs.bilkent.edu.tr, kurtcephe@gmail.com.*

*Manuscript received 26 Feb. 2012; revised 15 Oct. 2012; accepted 17 Oct. 2012; published online 23 Oct. 2012.*

*Recommended for acceptance by B.C. Ooi.*

*For information on obtaining reprints of this article, please send e-mail to: tkde@computer.org, and reference IEEECS Log Number TKDE-2012-02-0128.*

*Digital Object Identifier no. 10.1109/TKDE.2012.214.*

#### References

**H. Altay Güvenir**received the BS and MS degrees in electronics and communications engineering from Istanbul Technical University, in 1979 and 1981, respectively, and the PhD degree in computer engineering and science from Case Western Reserve University in 1987. He joined the Department of Computer Engineering at Bilkent University in 1987. He has been a professor and serving as the chairman of the Department of Computer Engineering since 2001. His research interests include artificial intelligence, machine learning, data mining, and intelligent data analysis. He is a member of the IEEE and the ACM.

**Murat Kurtcephe**received the first MSc degree in computer science from Bilkent University, Ankara, Turkey, where he worked on machine learning and data mining, especially on discretization and risk estimation methods with Prof. H. Altay Güvenir. He received the second MSc degree from the Computer Science Department at Case Western University, where he worked with Professor Meral Ozsoyoglu on query visualization and pedigree data querying. Currently, he is working at NorthPoint Solutions, New York, focusing on regulatory compliance reporting applications.

| |||