# Ranking Instances by Maximizing the Area under ROC Curve

H. Altay , IEEE
Murat Kurtcephe

Pages: pp. 2356-2366

Abstract—In recent years, the problem of learning a real-valued function that induces a ranking over an instance space has gained importance in machine learning literature. Here, we propose a supervised algorithm that learns a ranking function, called ranking instances by maximizing the area under the ROC curve (RIMARC). Since the area under the ROC curve (AUC) is a widely accepted performance measure for evaluating the quality of ranking, the algorithm aims to maximize the AUC value directly. For a single categorical feature, we show the necessary and sufficient condition that any ranking function must satisfy to achieve the maximum AUC. We also sketch a method to discretize a continuous feature in a way to reach the maximum AUC as well. RIMARC uses a heuristic to extend this maximization to all features of a data set. The ranking function learned by the RIMARC algorithm is in a human-readable form; therefore, it provides valuable information to domain experts for decision making. Performance of RIMARC is evaluated on many real-life data sets by using different state-of-the-art algorithms. Evaluations of the AUC metric show that RIMARC achieves significantly better performance compared to other similar methods.

Keywords—Ranking; data mining; machine learning; decision support

## Introduction

In this paper, we propose a binary classification methodology that ranks instances based upon how likely they are to have a positive label. Our method is based on receiver operating characteristic (ROC) analysis, and attempts to maximize the area under the ROC curve (AUC); hence, the algorithm is called ranking instances by maximizing the area under ROC curve (RIMARC). The RIMARC algorithm learns a ranking function which is a linear combination of nonlinear score functions constructed for each feature separately. Each of these nonlinear score functions aims to maximize the AUC by considering only the corresponding feature in ranking. All continuous features are first discretized into categorical ones in a way that optimizes the AUC. Given a single categorical feature, it is possible to derive a scoring function that achieves the maximum AUC. We show the necessary and sufficient condition that such a scoring function for a single feature has to satisfy for achieving the maximum AUC. Computing the score function for a categorical feature requires only one pass over the training data set. Missing feature values are simply ignored. The AUC value, obtained on the training data, for a single feature, reflects the effect of that feature in the correct ranking. Using these AUC values as weights, the RIMARC algorithm combines these score functions, learned for each feature, into a single ranking function.

The main characteristics of the RIMARC algorithm can be summarized as follows: It achieves comparably high AUC values. Its time complexity for both learning and applying the ranking function is relatively low. Being a nonparametric method, it does not require tuning of parameters to achieve the best performance. It is robust to missing feature values. Finally, the ranking function learned is in a human readable form that can be easily interpreted by domain experts, listing the effects (weight) of features and how their particular values affect the ranking. The RIMARC algorithm is simple and easy to implement. In cases where a data set is collected from experiments for research purposes, the researchers may be more interested in the effects of the features and their particular values on ranking than the particular ranking function.

In the next section, the ranking problem is revisited. Section 3 covers ROC, AUC and research on AUC maximization. In Section 4, the RIMARC method and implementation details are given. Section 5 presents the empirical evaluation of RIMARC on real-world data sets. Section 6 discusses the related work. Finally, Section 7 concludes with some suggestions for future work.

## Ranking

The ranking problem can be viewed as a binary classification problem with additional ordinal information. In the binary classification problem, the learner is given a finite sequence of labeled training examples $z = ((x_{1}, y_{1}),\ldots , (x_{n}, y_{n}))$ , where the $x_{i}$ are instances in some instance space $X$ and the $y_{i}$ are labels in $Y = \{\bf p, n\}$ , and the goal is to learn a binary-valued function $h : X \rightarrow Y$ that predicts accurately labels of future instances.

The problem of finding a function that ranks positive instances higher than the negative ones is referred as the bipartite ranking problem. Here, a training data set $D$ , from an instance space $X$ , is given, where the instances come from a set of two categories, positive and negative, represented as { p, n}. Using $D$ , the goal is to learn a ranking function $r: X \rightarrow {\hbox{\rlap{I}\kern 2.0pt{\hbox{R}}}}$ that ranks future positive instances higher than negative ones. In other words, the function $r$ is expected to assign higher values to positive instances than to negative ones. Then, the instances can be ordered using the values provided by the ranking function.

## Introduction to ROC Analysis

The ROC graph is a tool that can be used to visualize, organize, and select classifiers based on their performance [ 22]. It has become a popular performance measure in the machine learning community after it was realized that accuracy is often a poor metric to evaluate classifier performance [ 33], [ 41], [ 42].

The literature on ROC is more established to deal with binary classification problems than multiclass ones. At the end of the classification phase, some classifiers simply map each instance to a class label (discrete output). Some other classifiers, such as naïve Bayes or neural networks are able to estimate the probability of an instance belonging to a specific class (continuous valued output).

Binary classifiers produce a discrete output represented by only one point in the ROC space, since only one confusion matrix is produced from their classification output. Continuous-output-producing classifiers can have more than one confusion matrix by applying different thresholds to predict class membership. All instances with a score greater than the threshold are predicted to be p class and all others are predicted to be n class. Therefore, for each threshold value, a separate confusion matrix is obtained. The number of confusion matrices is equal to the number of ROC points on an ROC graph. With the method proposed by Domingos [ 15], it is possible to obtain ROC curves even for algorithms that are unable to produce scores.

Let a set of instances, labeled as p or n, be ranked by some scoring function. Given a threshold value , instances whose score is below $\tau$ are predicted as n, and those with score higher than are predicted as p. For a given threshold value, TP is equal to the number of positive instances that have been classified correctly and FP is equal to the number of negative instances that have been misclassified.

The ROC graph can be plotted as the fraction of true positives out of the positives ( TPR = true positive rate) versus the fraction of false positives out of the negatives ( FPR = false positive rate). The values of TPR and FPR are calculated by using (1). In this equation, $N$ is the number of total negative instances and $P$ is the number of total positive instances

$TPR = {TP \over P} ,\quad FPR = {{FP} \over {N}}.$

(1)

For each possible distinct value, a distinct ( FPR, TPR) value is computed. The ROC space is a 2D space with a range of [0, 1] on both axes. In ROC space the vertical axis represents the true positive rate of a classification output, while the horizontal axis represents the false positive rate. That is, each ( FPR, TPR) pair corresponds to a point on the ROC space. In a data set with $s$ distinct classifier scores, there are $s+1$ ROC points, including the trivial (0, 0) and (1, 1) points.

Although ROC graphs are useful for visualizing the performance of a classifier, a scalar value is needed to compare classifiers. Bradley [ 5] proposes the area under the ROC curve as a performance measure.

The ROC graph space is a one-unit square. The highest possible AUC value is 1.0, which represents the perfect ordering. In ROC graphs, a 0.5 AUC value means random guessing has occurred and values below 0.5 are not realistic as they can be negated by changing the decision criteria of the classifier.

The AUC value of a classifier is equal to the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance [ 22].

In an empirical ROC curve, the AUC is usually estimated by the trapezoidal rule. According to this rule, trapezoids are formed using the observed points as corners, the areas of these trapezoids are computed and then they are added up. Fawcett [ 22] proposed efficient algorithms for generating the ROC points and computing the AUC.

It can be shown that the area under the ROC curve is closely related to the Mann-Whitney U, which tests whether positives are ranked higher than negatives. AUC is also equivalent to the Wilcoxon test of ranks [ 30].

## RIMARC

RIMARC is a simple, yet powerful, ranking algorithm designed to maximize the AUC metric directly. The RIMARC algorithm reduces the problem of finding a ranking function for the whole set of features into finding a ranking function for a single categorical feature, and then combines these functions to form one covering all features. We will show that it is possible to determine a ranking function that achieves the maximum possible AUC for a single categorical feature. During the training phase, the RIMARC algorithm first discretizes the continuous features by a method called MAD2C, proposed by Kurtcephe and Guvenir [ 35]. The MAD2C method discretizes a continuous feature in such a way that results in a set of categorical values so that the AUC of the new categorical feature is the maximum. At this point, all the features are categorical. Then, the score for a value $v$ of a feature $f$ is assigned to be the probability of a training instance that has the value $v$ for the feature $f$ to have the class label p. During the computation of the score values, instances whose value for the feature $f$ is missing are simply ignored. For each feature, the values are sorted in the increasing order of their score, and the AUC is computed. Finally, the weight of a feature $f$ is computed as, $w_f = 2(AUC(f) - 0.5)$ where AUC( $f$ ) is the AUC obtained on the feature $f$ . That is, the weight of a feature is a linear function of its AUC value calculated using the training instances whose value for that feature is known. A higher value of AUC for a feature is an indication of its higher relevance in determining the class label. For example, if the AUC computed for a feature $f$ is 1, than it means that all instances in the training set can be ranked by using only the values of $f$ . Hence, we can expect that new query instances can also be ranked correctly by using $f$ only. The training method of the RIMARC algorithm is given in Algorithm 2.

For a given query, $q$ , the ranking function, $r$ (), of the RIMARC algorithm returns a real value $r(q)$ in the range of [0, 1]. This value $r$ ( $q$ ) is roughly the probability that the instance $q$ has the class label p. It is only a rough estimate of the probability, since it is very likely that no other instance with exactly the same feature values has been observed in the training set. The ranking function of the RIMARC algorithm determines this estimated probability by computing the weighted average of probabilities computed on single features, as shown in Algorithm 1.

Figure

Figure

In the following sections, we will show how the ranking function and the score values can be defined for a single categorical feature.

### 4.1 Single Categorical Feature Case

A categorical feature has a finite set of choices as its values. Let $V = \{{v_{1}, \ldots v_{k}}\}$ be a the set of categorical values for a given feature. In that case, the training data set $D$ is a set of instances represented by a vector of feature value and class label as ${<} v_{i},c {>}$ , where $v_{i} \in{V}$ and $c \in \{{\bf p,n}\}$ . A ranking function $r: V \rightarrow$ [0, 1] can be defined to rank the values in $V$ . According to this ranking function, a value $v_{j}$ comes after a value $v_i$ if and only if $r(v_i) < r(v_j)$ ; hence, $r$ defines a total ordering on the set $V$ , that is $v_{i} \le v_{j}$ . A pair of consecutive values $v_{i}$ and $v_{i+1}$ defines an ROC point $R_{i}$ on the ROC space. The coordinates of the point $R_{i}$ are ( $FPR_{i}, TPR_{i}$ ). The instances of $D$ can then be ranked according to the values of $r(v_{i}$ ) corresponding to their feature values, $v_{i}$ .

Let $D$ be a data set with a single categorical feature whose value set is $V = \{v_{1}, \ldots, v_{k}\}$ , and $r: V \rightarrow$ [0, 1] be the ranking function that orders the values of $V$ , such that $v_{i} \le v_{i+1} \;{\rm if} \;r(v_{i})< r(v_{i+1})$ , for all values of $1 \le i \le k$ . Since there is only one feature, this function $r$ (), ranks the instances, directly.

It is interesting to note that, if the values of the ranking function for two consecutive values $v_{i}$ and $v_{i+1}$ are swapped, then the only change in the ROC curve is that the ROC point corresponding to the $v_{i}$ and $v_{i+1}$ values moves to a new location so that the slopes of the line segments adjacent to that ROC point are swapped.

For example, consider a data set with a single categorical feature, given as $D \;=\; \{{\rm (a,{\bf n})}, ({\rm b,{\bf p}}), ({\rm b,{\bf n}}),({\rm b},{\bf n}), ({\rm b},{\bf n}){,}$({\rm c,{\bf p}}), ({\rm c,{\bf p}}),({\rm c,{\bf n}}), ({\rm c,{\bf n}}), ({\rm d,{\bf p}}),({\rm d,{\bf p}}), ({\rm d,{\bf n}})\} , where V \;{=}$\{{\rm a, b, c, d}\}$ . If a ranking function $r$ orders the values of $V$ as ${\rm a}\le {\rm c}\le {\rm b}\le {\rm d}$ , with $r({\rm a}) < r({\rm c}) < r({\rm b}) < r({\rm d})$ , the ROC curve shown in Fig. 1a will be obtained. If the ranking function is modified so that values of $r{\rm (b)}$ and $r{\rm (c)}$ are swapped, the ROC curve shown in Fig. 1b will be obtained. A similar technique was used earlier by earlier by Flach and Wu [ 24] to create better prediction models for classifiers.

Figure    Fig. 1. Effect of swapping the values of the ranking function of two feature values.

This property of the ROC spaces helps to remove the concavities in an ROC curve, resulting in a larger AUC. To obtain the maximum AUC value, the ROC curve has to be convex. Hence, we will show how to construct the ranking function so that the resulting ROC curve is guaranteed to be convex.

Note that the slope of the line segment between two consecutive ROC points $R_{i}$ and $R_{i+1}$ is

$s_i = {TPR_i - TPR_{i + 1} \over FPR_i - FPR_{i + 1}}.$

(2)

In order for the ROC curve to be convex, the slopes of all line segments connecting consecutive ROC points starting from the trivial ROC point (1, 1) must be nondecreasing, as shown in Fig. 2. Therefore, the condition for a convex ROC curve is

$\mathop{\forall i}_{1 \le i < k} \quad s_i \ge \;s_{i + 1}.$

(3)

Using (2),

$\mathop{\forall i}_{1 \le i < k} \quad {{TPR_{i + 1} - TPR_{i + 2}} \over {FPR_{i + 1} - FPR_{i + 2}}} \ge {{TPR_i - TPR_{i + 1}} \over {FPR_i - FPR_{i + 1}}}.$

By definition, $TPR_i = {{ TP_i }\over {P}}$ , where $TP_{i}$ is the number of true positives with value $v_{i}$ . Further, due to the ordering of values, $TP_i = P_i + TP_{i + 1}$ , where $P_{i}$ is the number of p-labeled instances with value ${\rm v}_{i}$ . Hence,

$TPR_i - TPR_{i + 1} = {{TPR_i} \over {P}} - {{TPR_{i + 1}}\over {P}} = {{1} \over{P}}\big( {P_i + TP_{i + 1} - TP_{i + 1} } \big) = {{P_i} \over {P}}.$

Similarly,

$FPR_i - FPR_{i + 1} = {{N_i} \over {N}}.$

$TPR_{i + 1} - TPR_{i + 2} = {{P_{i + 1}} \over {P}}$ and

$FPR_{i + 1} - FPR_{i + 2} = {{N_{i + 1}} \over {N}}.$

Therefore, the condition in (3) can be rewritten as

$\mathop{\forall i}_{1 \le i < k} \quad {{P_{i + 1}/P} \over {N_{i + 1} / N}} \ge {{P_i / P} \over {N_i / N}}.$

Finally,

$\mathop{\forall i}_{1 \le i < k} \quad {{P_{i+1}}\over {N_{i+1}}} \ge {{P_i} \over {N_i}}.$

(4)

That is, to obtain a convex ROC curve, the condition in (4) must be satisfied for all ROC points. In other words, to achieve the maximum AUC, the ranking function has to satisfy the following condition:

$\mathop {\forall i }_{1 \le i < k} r(v_{i + 1} ) > r(v_i ) \;{\rm iff}\; {{P_{i + 1}} \over {N_{i + 1} }} \ge {{{P_i }} \over {{N_i }}}.$

(5)

Further, any ranking function that satisfies (5) will result in the same ROC curve and achieve the maximum AUC. For example, the ranking function defined as $r(v_i ) = {{P_i} \over {N_i}}$ will result in a convex ROC curve.

Figure    Fig. 2. Relation between the slopes of two consecutive line segments in a convex ROC curve.

It is also important to note that, given a data set with a single categorical feature, there exists exactly one convex ROC curve, and it corresponds to the best ranking.

The general assumptions for ranking problems are given below:

$\matrix{\displaystyle \mathop{\forall i}_{1 \le i \le k} P_i \; \ge 0,\hfill&\displaystyle \mathop{\forall i}_{1 \le i \le k} \quad N_i \; \ge 0, \hfill\cr \displaystyle P = \sum_1^k {P_i }> 0,& \displaystyle N = \sum_1^k {N_i } > 0.}$

Although the data set is guaranteed to have at least one instance with class label p and one instance with label n, it is possible that for some values of $i$ , $N_{i}$ may be 0. In such cases the ranking function defined as $r(v_i ) = {P_i}/{N_i}$ will have an undefined value. To avoid such problems, the RIMARC algorithm defines the ranking function as

$r(v_i ) = {{{P_i }} \over {{P_i + N_i }}}.$

(6)

Note that, since ${\forall i}_{1 \le i \le k} \;P_i + N_i > 0$ , $r$ ( $v_i$ ) is defined for all values of $i$ . To see that the ranking function defined in (6) satisfies the condition in (5), note that if

$\mathop{\forall i}_{0 \le i < n} \quad {{P_{i + 1}} \over {P_{i + 1} + N_{i + 1}}} \ge {{P_i } \over {P_i + N_i }},$

then

$\mathop {\forall i}_{0 \le i < n} \quad P_{i + 1} \left( {P_i \; + N_i } \right) \ge P_i \left( {P_{i + 1} \; + N_{i + 1} } \right),$

and

$\mathop {\forall i}_{0 \le i < n} \quad {{{P_{i + 1} }} \over {{N_{i + 1} }}} \ge {{{P_i }} \over {{N_i }}}.$

The ranking function given in (6) has another added benefit, in that it is simply the probability of the p label among all instances with value $v_{i}$ . Such a probability value is easily interpretable by humans.

However, note that the commonly used Laplace estimate, defined as ${{P_i + 1} \over {P_i + N_i + 2}}$ does not satisfy the condition in (5).

#### 4.1.1 The Effect of the Choice of the Class Labeling on the AUC

To calculate the $P$ and $N$ values, one of the classes should be labeled as p and the other class as n, but one can question the effect this choice has on the AUC value. It is possible to show that the AUC value of a categorical feature is independent from the choice of class labels by using the value from the Wilcoxon-Mann-Whitney statistics.

In (7), the AUC formula based on the Wilcoxon-Mann-Whitney statistics is given. The set $D_{p}$ represents the p-labeled instances and $D_{n}$ represents the n-labeled instances. $D_{pi}$ is the ranking of the $i$ th instance in the $D_{p}$ set, similarly, $D_{nj}$ , is the ranking of the $i$ th instance in the $D_{p}$ set

\eqalign{AUC &= {{\sum_{i = 1}^P {\sum_{j = 1}^N {f(D_{pi} ,D_{nj} )} }} \over {{PN}}}, \cr f &= \left[ \matrix{ 1 & {{\rm if}} & {D_{pi} > D_{nj} } \cr 0 & {{\rm if}} & {D_{pi} < D_{nj} } \cr {0.5} & {{\rm if}} & {D_{pi} = D_{nj} } \cr} \right].}

(7)

The dividend part of the AUC formula in (7) counts the number of p-labeled instances for each element of the $D_{p}$ set whose ranking is higher than any element of the $D_{n}$ set. Then, AUC is calculated by dividing this summation by the multiplication of the p-labeled and n-labeled elements.

It is straightforward that the divisor part of the AUC formula is independent of the choice of the class labels. Assume that the ranking function given in (6) is used on the data set $D$ , and $D_{p}$ and $D_{n}$ sets are formed. Let $n_{i}$ be the number of n-labeled instances whose ranking is lower than the $i$ th element of the $D_{p}$ set and let $r_{i}$ be the score assigned to this element. When the classes are swapped, the new ranking score $r^{\prime}_{i}$ is equal to $1- r_{i}$ . Using this property, all instance scores are subtracted from 1. However, this operation simply reverses the ranking of the instances. So the formula in (7), which calculates the value of AUC, using the ranking of the instances, is independent of the choice of class labeling.

## Related Work

The problem of learning a real-valued function that induces a ranking or ordering over an instance space has gained importance in machine learning literature. Information retrieval, credit-risk screening or estimation of risks associated with a surgery are some examples of the application domains. In this paper, we consider the ranking problem with binary classification data. It is known as the bipartite ranking problem, which refers to the problem of learning a ranking function from a training set of examples with binary labels [ 2], [ 10], [ 26]. Agarwal and Roth [ 2] studied the learnability of bipartite ranking functions and showed that learning linear ranking functions over Boolean domains is NP-hard. In the bipartite ranking problem, given a training set of instances with labels either positive or negative, one wants to learn a real-valued ranking function that can be used for an unseen case to associate a measure of being close to positive (or negative) class. For example, in a medical domain, a surgeon may be concerned with estimating the risk of a patient who is planned to undergo a serious operation. A successful ranking (or scoring) function is expected to return a high value if the operation carries high risks for that patient. Specific ranking functions have been developed for particular domains, such as information retrieval [ 18], [ 47], finance [ 6], medicine [ 12], [ 14], [ 44], fraud detection [ 20], and insurance [ 17]. Some of these methods are dependent on statistical models while some are based on machine learning algorithms. A binary classification algorithm that returns a confidence factor associated with the class label can be used for bipartite ranking, where the confidence factor associated with a positive label (or the complement associated with a negative label) can be taken as the value of the ranking function.

The area under the receiver operating characteristic curve (AUC) is a widely accepted performance measure for evaluating the quality of a ranking function [ 5], [ 22]. It has been shown that the AUC represents the probability that a randomly chosen positive instance is correctly assigned a higher rank value than a randomly selected negative instance. Further, this probability of correct ranking is equal to the value estimated by the nonparametric Wilcoxon statistic [ 30]. Also, AUC has important features such as insensitivity to class distribution and cost distributions [ 5], [ 22], [ 33]. Agarwal et al. [ 1] showed what kind of classification algorithms can be used for ranking problems and proved theorems about generalization properties of AUC.

Some approximation methods aiming at maximizing the global AUC value directly have been proposed by researchers [ 31], [ 39], [ 51]. For example, Ataman et al. [ 3] proposed a ranking algorithm by maximizing AUC with linear programming. Brefeld and Scheffer [ 7] presented an AUC maximizing support vector machine. Rakotomamonjy [ 43] proposed a quadratic programming-based algorithm for AUC maximization and showed that under certain conditions 2-norm soft margin support vector machines can also maximize AUC. Toh et al. [ 46] designed an algorithm to optimize the ROC performance directly according to the fusion classifier. Ferri et al. [ 23] proposed a method to locally optimize AUC in decision tree learning, and Cortes and Mohri [ 13] proposed boosted decision stumps. To maximize AUC in rule learning, several algorithms have been proposed [ 4], [ 21], [ 40]. A nonparametric linear classifier based on the local maximization of AUC was presented by Marrocco et al. [ 38]. A ROC-based genetic learning algorithm has been proposed by Sebag et al. [ 44]. Marrocco et al. [ 37] used linear combinations of dichotomizers for the same purpose. Freund et al. [ 26] gave a boosting algorithm combining multiple rankings. Cortes and Mohri [ 13] showed that this approach also aims to maximize AUC. A method by Tax et al. [ 45] that weighs features linearly by optimizing AUC has been applied to the detection of interstitial lung disease. Ataman et al. [ 3] advocated an AUC-maximizing algorithm with linear programming. Joachims [ 34] proposed a binary classification algorithm by using SVM that can maximize AUC. Ling and Zhang [ 36] compared AUC-based tree-augmented naïve Bayes (TAN) and error-based TAN algorithms; the AUC-based algorithms are shown to produce more accurate rankings. More recently, Calders and Jaroszewicz [ 8] suggested a polynomial approximation of AUC to optimize it efficiently. Linear combinations of classifiers are also used to maximize AUC in biometric scores fusion [ 46]. Han and Zhao [ 29] proposed a linear classifier based on active learning that maximizes AUC.

## Conclusions and Future Work

In this paper, we presented a supervised algorithm for learning a ranking function, called RIMARC.

We have shown that for a categorical feature, there is only one ordering that gives the maximum AUC. Then, we showed the necessary and sufficient condition that a ranking function for a single categorical feature has to satisfy to achieve this ordering. As a result, we proposed a ranking function that achieves the maximum possible AUC value on a single categorical feature. This ranking function is based on the probability of p class for each value of that feature. The MAD2C algorithm used by RIMARC discretizes continuous features in a way that yields the maximum AUC, as well. The RIMARC algorithm used AUC values of features as their weights in computing the ranking function. With this simple heuristic, we computed the weighted average of all feature value scores to achieve maximum AUC over the whole feature set. Since the RIMARC algorithm uses all available feature values and ignores the missing ones, it is robust to missing feature values.

We presented the characteristics of the ranking function learned by the RIMARC algorithms and how it can be interpreted. The ranking function is in a human readable form that can be easily interpreted by domain experts. The feature weights learned help the experts to determine how they affect the ranking.

We compared RIMARC with 27 different algorithms. According to our empirical evaluations, RIMARC significantly outperformed 17 algorithms on an AUC basis and 13 algorithms on a time basis. It also outperformed all algorithms on the average AUC and 16 of them on an average running time basis.

It is also worth noting that the RIMARC algorithm is a non-parametric machine learning algorithm. As such, it does not have any parameters that need to be tuned to achieve high performance on a given data set; hence, it can be used by domain experts who are not experienced in tuning machine learning algorithms.

To improve the performance of RIMARC, instead of using the weighted average, other approaches can be investigated. Another possible direction for future work would be to experiment with methods that ensemble RIMARC with other ranking algorithms.

## References

• 1. S. Agarwal, T. Graepel, R. Herbrich, S. Har-Peled, and D. Roth, “Generalization Bounds for the Area under the ROC Curve,” J. Machine Learning Research, vol. 6, pp. 393-425, 2005.
• 2. S. Agarwal, and D. Roth, “Learnability of Bipartite Ranking Functions,” Proc. 18th Ann. Conf. Learning Theory, 2005.
• 3. K. Ataman, W.N. Street, and Y. Zhang, “Learning to Rank by Maximizing AUC with Linear Programming,” Proc. IEEE Int'l Joint Conf. Neural Networks (IJCNN), pp. 123-129, 2006.
• 4. H. Boström, “Maximizing the Area under the ROC Curve Using Incremental Reduced Error Pruning,” Proc. Int'l Conf. Machine Learning Workshop (ICML '05), 2005.
• 5. A.P. Bradley, “The Use of the Area under the ROC Curve in the Evaluation of Machine Learning Algorithms,” Pattern Recognition, vol. 30, no. 7, pp. 1145-1159, 1997.
• 6. B.O. Bradley, and M.S. Taqqu, “Handbook of Heavy-Tailed Distributions in Finance,” Financial Risk and Heavy Tails, S.T. Rachev, ed., pp. 35-103, Elsevier, 2003.
• 7. U. Brefeld, and T. Scheffer, “AUC Maximizing Support Vector Learning,” Proc. ICML Workshop ROC Analysis in Machine Learning, 2005.
• 8. T. Calders, and S. Jaroszewicz, ”Efficient AUC Optimization for Classification,” Proc. 11th European Conf. Principles and Practice of Knowledge Discovery in Databases (PKDD '07), pp. 42-53, 2007.
• 9. C.C. Chang, and C.C. Lin, “LIBSVM: A Library for Support Vector Machines,” http://www.csie.ntu.edu.tw/~cjlin/libsvm, 2001.
• 10. S. Cleménçon, G. Lugosi, and N. Vayatis, “Ranking and Scoring Using Empirical Risk Minimization,” Proc. 18th Ann. Conf. Learning Theory (COLT '05), pp. 1-15, 2005.
• 11. W.W. Cohen, R.E. Schapire, and Y. Singer, “Learning to Order Things,” J. Artificial Intelligence Research, vol. 10, pp. 243-270, 1998.
• 12. R.M. Conroy, K. Pyörälä, and A.P. Fitzgerald, “Estimation of Ten-Year Risk of Fatal Cardiovascular Disease in Europe: The SCORE Project,” European Heart J., vol. 11, pp. 987-1003, 2003.
• 13. C. Cortes, and M. Mohri, “AUC Optimization versus Error Rate Minimization,” Proc. Conf. Neural Information Processing Systems (NIPS '03), vol. 16, pp. 313-320, 2003.
• 14. R.B. D'Agostino, S.V. Ramachandran, and J. Pencina, “General Cardiovascular Risk Profile for Use in Primary Care: The Framingham Heart Study,” Circulation, vol. 17, pp. 743-753, 2008.
• 15. P. Domingos, “MetaCost: A General Method for Making Classifiers Cost-Sensitive,” Proc. Fifth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 155-164, 1999.
• 16. J. Dougherty, R. Kohavi, and M. Sahami, “Supervised and Unsupervised Discretization of Continuous Features,” Proc. 12th Int'l Conf. Machine Learning, pp. 194-202, 1995.
• 17. K. Dowd, and D. Blake, “After VaR: The Theory, Estimation, and Insurance Applications of Quantile-Based Risk Measures,” The J. Risk and Insurance, vol. 73, no. 2, pp. 193-229, 2006.
• 18. W. Fan, M.D. Gordon, and P. Pathak, “Discovery of Context-Specific Ranking Functions for Effective Information Retrieval Using Genetic Programming,” IEEE Trans. Knowledge and Data Eng., vol. 16, no. 4, pp. 523-27, Apr. 2004.
• 19. U. Fayyad, and K. Irani, “On the Handling of Continuous-Valued Attributes in Decision Tree Generation,” Machine Learning, vol. 8, pp. 87-102, 1992.
• 20. T. Fawcett, and F. Provost, “Adaptive Fraud Detection,” Data Mining and Knowledge Discovery, vol. 1, pp. 291-316, 1997.
• 21. T. Fawcett, “Using Rule Sets to Maximize ROC Performance,” Proc. IEEE Int'l Conf. Data Mining (ICDM '01), pp. 131-138, 2001.
• 22. T. Fawcett, “An Introduction to ROC Analysis,” Pattern Recognition Letters, vol. 27, pp. 861-874, 2006.
• 23. C. Ferri, P. Flach, and J. Hernandez, “Learning Decision Trees Using the Area under the ROC Curve,” Proc. 19th Int'l Conf. Machine Learning (ICML '02), pp. 139-146, 2002.
• 24. P. Flach, and S. Wu, “Repairing Concavities in ROC Curves,” Proc. UK Workshop Computational Intelligence, pp. 38-44, 2003.
• 25. A. Frank, and A. Asuncion, “UCI Machine Learning Repository,” School of Information and Computer Science, Univ. of California, http://archive.ics.uci.edu/ml, 2010.
• 26. Y. Freund, R. Iyer, R.E. Schapire, and Y. Singer, “An Efficient Boosting Algorithm for Combining Preferences,” J. Machine Learning Research, vol. 4, pp. 933-969, 2003.
• 27. H.A. Güvenir, and . irin, “Classification by Feature Partitioning,” Machine Learning, vol. 23, no. 1, pp. 47-67, 1996.
• 28. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I.H. Witten, “The WEKA Data Mining Software: An Update,” SIGKDD Explorations, vol. 11, no. 1, pp. 10-18, 2009.
• 29. G. Han, and C. Zhao, “AUC Maximization Linear Classifier Based on Active Learning and Its Application,” Neurocomputing, vol. 73, nos. 7-9, pp. 1272-1280, 2010.
• 30. J.A. Hanley, and B.J. McNeil, “The Meaning and use of the Area under a Receiver Operating Characteristic (ROC) Curve,” Radiology, vol. 143, pp. 29-36, 1982.
• 31. A. Herschtal, and B. Raskutti, “Optimising the Area under the ROC Curve Using Gradient Descent,” Proc. Int'l Conf. Machine Learning, pp. 49-56, 2004.
• 32. R.C. Holte, “Very Simple Classification Rules Perform Well on Most Commonly Used Data Sets,” Machine Learning, vol. 11, pp. 63-91, 1993.
• 33. J. Huang, and C.X. Ling, “Using AUC and Accuracy in Evaluating Learning Algorithms,” IEEE Trans. Knowledge and Data Eng., vol. 17, no. 3, pp. 299-310, Mar. 2005.
• 34. T. Joachims, “A Support Vector Method for Multivariate Performance Measures, “ Proc. Int'l Conf. Machine Learning (ICML), 2005.
• 35. M. Kurtcephe, and H.A. Güvenir, “A Discretization Method Based on Maximizing the Area under ROC Curve,” Int'l J. Pattern Recognition and Artificial Intelligence, vol. 27, no. 1,article 1350002, 2013.
• 36. C.L. Ling, and H. Zhang, “Toward Bayesian Classifiers with Accurate Probabilities,” Proc. Sixth Pacific-Asia Conf. Advances in Knowledge Discovery and Data Mining, pp. 123-134, 2002.
• 37. C. Marrocco, M. Molinara, and F. Tortorella, “Exploiting AUC for Optimal Linear Combinations of Dichotomizers,” Pattern Recognition Letters, vol. 27, no. 8, pp. 900-907, 2006.
• 38. C. Marrocco, R.P.W. Duin, and F. Tortorella, “Maximizing the Area under the ROC Curve by Pairwise Feature Combination,” Pattern Recognition, vol. 41, pp. 1961-1974, 2008.
• 39. M.C. Mozer, R. Dodier, M.D. Colagrosso, C. Guerra-Salcedo, and R. Wolniewicz, “Prodding the ROC Curve: Constrained Optimization of Classifier Performance,” Proc. Conf. Advances in Neural Information Processing Systems, vol. 14, pp. 1409-1415, 2002.
• 40. R. Prati, and P. Flach, “Roccer: A ROC Convex Hull Rule Learning Algorithm,” Proc. ECML/PKDD Workshop Advances in Inductive Rule Learning, pp. 144-153, 2004.
• 41. F. Provost, and T. Fawcett, “Analysis and Visualization of Classifier Performance: Comparison under Imprecise Class and Cost Distributions,” Proc. Third Int'l Conf. Knowledge Discovery and Data Mining, pp. 43-48, 1997.
• 42. F. Provost, T. Fawcett, and R. Kohavi, “The Case against Accuracy Estimation for Comparing Induction Algorithms,” Proc. 15th Int'l Conf. Machine Learning, pp. 445-453, 1998.
• 43. A. Rakotomamonjy, “Optimizing Area under ROC Curve with SVMS,” Proc. Workshop ROC Analysis in Artificial Intelligence, pp. 71-80, 2004.
• 44. M. Sebag, J. Aze, and N. Lucas, “ROC-Based Evolutionary Learning: Application to Medical Data Mining,” Artificial Evolution, vol. 2936, pp. 384-396, 2004.
• 45. D.J.M. Tax, R.P.W. Duin, and Y. Arzhaeva, “Linear Model Combining by Optimizing the Area under the ROC Curve,” Proc. IEEE 18th Int'l Conf. Pattern Recognition, pp. 119-122, 2006.
• 46. K.A. Toh, J. Kim, and S. Lee, “Maximizing Area under ROC Curve for Biometric Scores Fusion,” Pattern Recognition, vol. 41, pp. 3373-3392, 2008.
• 47. F. Wang, and X. Chang, “Cost-Sensitive Support Vector Ranking for Information Retrieval,” J. Convergence Information Technology, vol. 5, no. 10, pp. 109-116, 2010.
• 48. M. Wasikowski, and X. Chen, “Combating the Small Sample Class Imbalance Problem Using Feature Selection,” IEEE Trans. Knowledge Discovery and Data Eng., vol. 22, no. 10, pp. 1388-1400, Oct. 2010.
• 49. F. Wilcoxon, “Individual Comparisons by Ranking Methods,” Biometrics, vol. 1, pp. 80-83, 1945.
• 50. T.-F. Wu, C.-J. Lin, and W.C. Wen, “Probability Estimates for Multi-Class Classification by Pairwise Coupling,” J. Machine Learning Research, vol. 5, pp. 975-1005, 2004.
• 51. L. Yan, R. Dodier, M.C. Mozer, and R. Wolniewicz, “Optimizing Classifier Performance via the Wilcoxon-Mann-Whitney Statistics,” Proc. 20th Int'l Conf. Machine Learning, pp. 848-855, 2003.