Issue No.03  March (2013 vol.25)
pp: 589602
Shu Wu , Chinese Academy of Sciences, Beijing
Shengrui Wang , University of Sherbrooke, Sherbrooke
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.261
ABSTRACT
Outlier detection can usually be considered as a preprocessing step for locating, in a data set, those objects that do not conform to welldefined notions of expected behavior. It is very important in data mining for discovering novel or rare events, anomalies, vicious actions, exceptional phenomena, etc. We are investigating outlier detection for categorical data sets. This problem is especially challenging because of the difficulty of defining a meaningful similarity measure for categorical data. In this paper, we propose a formal definition of outliers and an optimization model of outlier detection, via a new concept of holoentropy that takes both entropy and total correlation into consideration. Based on this model, we define a function for the outlier factor of an object which is solely determined by the object itself and can be updated efficiently. We propose two practical 1parameter outlier detection methods, named ITBSS and ITBSP, which require no userdefined parameters for deciding whether an object is an outlier. Users need only provide the number of outliers they want to detect. Experimental results show that ITBSS and ITBSP are more effective and efficient than mainstream methods and can be used to deal with both large and highdimensional data sets where existing algorithms fail.
INDEX TERMS
Information retrieval, Search methods, Mutual information, Greedy algorithms, Complexity theory, Holoentropy, greedy algorithms, Outlier detection, holoentropy, total correlation, outlier factor, attribute weighting
CITATION
Shu Wu, Shengrui Wang, "InformationTheoretic Outlier Detection for LargeScale Categorical Data", IEEE Transactions on Knowledge & Data Engineering, vol.25, no. 3, pp. 589602, March 2013, doi:10.1109/TKDE.2011.261
REFERENCES
