This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Scoring Levels of Categorical Variables with Heterogeneous Data
March/April 2004 (vol. 19 no. 2)
pp. 14-19
Eugene Tuv, Intel
George C. Runger, Arizona State University

Data modeling in industry is often challenged by the complexity of massive, heterogeneous (mixed-type) data sets. Furthermore, categorical variables can potentially have hundreds (or even thousands) of levels (values). This work explores an efficient procedure for enriching an original data set (of practically any complexity) by assigning numerical scores to the levels of categorical variables. A novel-scoring objective attempts to preserve the mutual information between all the variables. A nontraditional clustering approach for mixed data (supervised-contrasting-independence clustering) is used. Although the preprocessing developed here was primarily motivated by the need for low-dimensional, exploratory visualization of complex, heterogeneous data, the proposed relatively simple preprocessing scheme can be useful in any distance-based learning problem. Two examples demonstrate this approach in instance-based, supervised applications.

Index Terms:
data preprocessing, quantification, mixed-type data, clustering keywords
Citation:
Eugene Tuv, George C. Runger, "Scoring Levels of Categorical Variables with Heterogeneous Data," IEEE Intelligent Systems, vol. 19, no. 2, pp. 14-19, March-April 2004, doi:10.1109/MIS.2004.1274906
Usage of this product signifies your acceptance of the Terms of Use.