Algorithm 1. The ranking function.
Algorithm 2. Training in RIMARC.
4.1.1 The Effect of the Choice of the Class Labeling on the AUC To calculate the and values, one of the classes should be labeled as p and the other class as n, but one can question the effect this choice has on the AUC value. It is possible to show that the AUC value of a categorical feature is independent from the choice of class labels by using the value from the Wilcoxon-Mann-Whitney statistics.
In (7), the AUC formula based on the Wilcoxon-Mann-Whitney statistics is given. The set represents the p-labeled instances and represents the n-labeled instances. is the ranking of the th instance in the set, similarly, , is the ranking of the th instance in the set
The dividend part of the AUC formula in (7) counts the number of p-labeled instances for each element of the set whose ranking is higher than any element of the set. Then, AUC is calculated by dividing this summation by the multiplication of the p-labeled and n-labeled elements.
It is straightforward that the divisor part of the AUC formula is independent of the choice of the class labels. Assume that the ranking function given in (6) is used on the data set , and and sets are formed. Let be the number of n-labeled instances whose ranking is lower than the th element of the set and let be the score assigned to this element. When the classes are swapped, the new ranking score is equal to . Using this property, all instance scores are subtracted from 1. However, this operation simply reverses the ranking of the instances. So the formula in (7), which calculates the value of AUC, using the ranking of the instances, is independent of the choice of class labeling.
4.1.2 An Example Toy Data Set Consider a toy training data set with a single categorical feature given in Table 1. To calculate the AUC value of this particular feature, assume that the ranking scores are calculated by the function in (6). The ranking scores of the categorical values are as follows: , and . The version of the data set sorted by the ranking function is given in Table 2. In this example, the value of is 7 and the value of is 6. The AUC value of this feature is calculated by using (7), as . When the class labels are swapped ( n labels are replaced by p labels and vice versa) the ranking scores are also swapped. The sorted version of the swapped toy data set is given in Table 3. Since the relative ranking of the instances does not change, the AUC value of the new ranking is also is 0.82.
Assumption. For a continuous feature, either the higher values are indicators of the p class and lower values are indicators of the of the n class, or vice versa.
4.2.1 The MAD2C Method As explained above, the RIMARC algorithm requires all features to be categorical. Therefore, the continuous features in a data set need to be converted into categorical ones, through discretization. The aim of a discretization method is to find the proper cut-points to categorize a given continuous feature.
The MAD algorithm given in [ 35] is defined for multiclass data sets. It is designed to maximize the AUC value by checking the ranking quality of values of a continuous feature. A special version of the MAD algorithm, called MAD2C, defined for two-class problems, is used in RIMARC.
The MAD2C algorithm first sorts the training instances in ascending order. Sorting is essential for all discretization methods to produce intervals that are as homogenous as possible. After the sorting operation, feature values are used as hypothetical ranking score values and the ROC graph of the ranking on that feature is constructed. The AUC of the ROC curve indicates the overall ranking quality of the continuous feature. To obtain the maximum AUC value, only the points on the convex hull must be selected. The minimum number of points that form the convex hull is found by eliminating the points that cause concavities on the ROC graph. In each pass, the MAD2C method compares the slopes in the order of the creation of the hypothetical lines, finds the junction points (cut-points) that cause concavities and eliminates them. This process is repeated until there is no concavity remaining on the graph. The points left on the graph are the cut-points, which will be used to discretize the feature.
It has been proven that the MAD2C method finds the cut-points that will yield the maximum AUC. It is also shown that the cut-points found by MAD2C never separate two consecutive instances of the same class. This is an important property, expected from a discretization method. The MAD2C method is preferred among other discretization methods since it has been shown in empirical evaluations that MAD2C has lower running time in two-class data sets and helps the classification algorithms to achieve a higher performance than other well-known discretization methods on AUC basis [ 35].
4.2.2 A Toy Data Set Discretization Example To visualize the discretization process using the MAD2C method, a toy data set is given in Table 4. After the sorting operation, the ROC points are formed. The ROC graph for the example in Table 4 is given in Fig. 3. In this example, the higher F1 values are indicators of the p class. If the lower values of feature F1 were indicators of the p class, then the p and n class labels would be swapped and the ROC graph below the diagonal would have been obtained. Note that these two ROC graphs are symmetric about the diagonal.
The first pass of the MAD2C algorithm is shown in Fig. 4. All points below or on the diagonal are ignored since they have no positive effect on the maximization of AUC. Then the points causing concavities are eliminated. MAD2C converged to the convex hull in one pass for this example, as shown in Fig. 4. The set of points left on the graph are reported as the discretization cut-points.
The authors are with the Department of Computer Engineering, Bilkent University, Ankara 06800, Turkey.
E-mail: email@example.com, firstname.lastname@example.org.
Manuscript received 26 Feb. 2012; revised 15 Oct. 2012; accepted 17 Oct. 2012; published online 23 Oct. 2012.
Recommended for acceptance by B.C. Ooi.
For information on obtaining reprints of this article, please send e-mail to: email@example.com, and reference IEEECS Log Number TKDE-2012-02-0128.
Digital Object Identifier no. 10.1109/TKDE.2012.214.
H. Altay Güvenir received the BS and MS degrees in electronics and communications engineering from Istanbul Technical University, in 1979 and 1981, respectively, and the PhD degree in computer engineering and science from Case Western Reserve University in 1987. He joined the Department of Computer Engineering at Bilkent University in 1987. He has been a professor and serving as the chairman of the Department of Computer Engineering since 2001. His research interests include artificial intelligence, machine learning, data mining, and intelligent data analysis. He is a member of the IEEE and the ACM.
Murat Kurtcephe received the first MSc degree in computer science from Bilkent University, Ankara, Turkey, where he worked on machine learning and data mining, especially on discretization and risk estimation methods with Prof. H. Altay Güvenir. He received the second MSc degree from the Computer Science Department at Case Western University, where he worked with Professor Meral Ozsoyoglu on query visualization and pedigree data querying. Currently, he is working at NorthPoint Solutions, New York, focusing on regulatory compliance reporting applications.