This Article 
 Bibliographic References 
 Add to: 
Unsupervised Learning with Mixed Numeric and Nominal Data
July/August 2002 (vol. 14 no. 4)
pp. 673-690

This paper presents a Similarity-Based Agglomerative Clustering (SBAC) algorithm that works well for data with mixed numeric and nominal features. A similarity measure, proposed by Goodall for biological taxonomy, that gives greater weight to uncommon feature value matches in similarity computations and makes no assumptions of the underlying distributions of the feature values, is adopted to define the similarity measure between pairs of objects. An agglomerative algorithm is employed to construct a dendrogram and a simple distinctness heuristic is used to extract a partition of the data. The performance of SBAC has been studied on real and artificially generated data sets. Results demonstrate the effectiveness of this algorithm in unsupervised discovery tasks. Comparisons with other clustering schemes illustrate the superior performance of this approach.

[1] H.W. Beck, T. Anwar, and S.B. Navathe, “A Conceptual Clustering Algorithm for Database Schema Design,” IEEE Trans. Knowledge and Data Eng., vol. 3, pp. 396-411, 1994.
[2] G. Biswas, J. Weinberg, and C. Li, “ITERATE: A Conceptual Clustering Method for Knowledge Discovery in Databases,” Artificial Intelligence in the Petroleum Industry, B. Braunschweig and R. Day eds., pp. 111-139, 1995.
[3] G. Biswas, J. Weinberg, and D.H. Fisher, “ITERATE: A Conceptual Clustering Algorithm for Data Mining,” IEEE Trans. Systems, Man, and Cybernetics, vol. 28C, pp. 219-230, May 1998.
[4] P. Cheesman, J. Kelly, M. Self, J. Stutz, W. Taylor, and D. Freeman, “Autoclass: A Bayesian Classification System,” Proc. Fifth Int'l Conf. Machine Learning, June 1988.
[5] P. Cheeseman and J. Stutz, “Bayesian Classification (AutoClass): Theory and Results,” Advances in Knowledge Discovery and Data Mining, AAAI Press/MIT Press, pp. 61-83, 1996.
[6] A.P. Dempster, N.M. Laird, and D.B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” J. Royal Statistical Soc. Series B, vol 39, no. 1, pp. 1-38, 1977.
[7] R. Duda, P. Hart, and D. Stork, Pattern Classification. New York: John Wiley&Sons, 2001.
[8] D.H. Fisher, “Knowledge Acquisition via Incremental Conceptual Clustering,” Machine Learning, no. 2, pp. 139-172, 1987.
[9] D.H. Fisher, “Iterative Optimization and Simplification of Hierarchical Clusterings,” J. Artificial Intelligence Research, vol. 4, pp. 147-180, 1996, /.
[10] D.H. Fisher and P. Langley, “Methods of Conceptual Clustering and Their Relation to Numeric Taxonomy,” Proc. Artificial Intelligence and Statistics, W. Gale, ed., 1986.
[11] R.A. Fisher, Statistical Methods for Research Workers. 13th ed. Edinburgh and London: Oliver and Boyd, 1963.
[12] W.J. Frawley, G.P. Shapiro, and C.J. Matheus, “Knowledge Discovery in Databases: An Overview,” Knowledge Discovery in Databases, G. Piatetsky-Shapiro and W. J. Frawley eds., pp. 1-27 Menlo Park, Calif.: AAAI/MIT Press, 1991.
[13] J.H. Gennari, P. Langley, and D.H. Fisher, "Models of Incremental Concept Formation," Artificial intelligence, vol. 40, nos. 1-3, pp. 11-59, 1989.
[14] M. Gluck and J. Corter, “Information, Uncertainty, and the Utility of Categories,” Proc. Seventh Ann. Conf. Cognitive Soc., pp. 283-287, 1985.
[15] D.W. Goodall, “A New Similarity Index Based On Probability,” Biometrics, vol. 22, pp. 882-907, 1966.
[16] S.J. Hanson and M. Bauer, “Conceptual Clustering, Categorization, and Polymorphy,” Machine Learning, no. 3, pp. 343-372, 1989.
[17] A.K. Jain and R.C. Dubes, Algorithms for Clustering Data. Englewood Cliffs, N.J.: Prentice Hall, 1988.
[18] J.L. Kolodner, “Maintaining Organization in a Dynamic Long-Term Memory,” Cognitive Science, vol. 7, pp. 243-280, 1983.
[19] Workshop on Case-Based Reasoning. J.L. Kolodner, ed., San Mateo, Calif.: Morgan Kaufmann, 1988.
[20] H.O. Lancaster, “The Combining of Probabilities Arising from Data in Discrete Distributions,” Biometrika, vol. 36, pp. 370-382, 1949.
[21] C. Li and G. Biswas, “Knowledge-based Scientific Discovery from Geological Databases,” Proc. First Int'l Conf. Knowledge Discovery and Data Mining, pp. 204-209, Aug. 1995.
[22] K. McKusick and K. Thompson, “COBWEB/3: A Portable Implementation,” Technical Report FIA-90-6-18-2, NASA Ames Research Center, 1990.
[23] S. Minton, J. Carbonell, C. Knoblock, D. Kuokka, O. Etzioni, and Y. Gil, “Explanation Based Learning: A Problem Solving Perspective,” Artificial Intelligence, vol. 40, pp. 63-118, 1989.
[24] D. Ourston and R.J. Mooney, “Theory Refinement Combining Analytic and Empirical Methods,” Artificial Intelligence, vol. 66, pp. 311-344, 1994.
[25] Y. Reich, “Building and Improving Design Systems: A Machine Learning Approach,” PhD thesis, Dept. Civil Eng., Carnegie Mellon Univ., 1991.
[26] Y. Reich and S. Fenves, “The Formation and Use of Abstract Concepts in Design,” Concept Formation: Knowledge and Experience in Unsupervised Learning, D.H. Fisher, M.J. Pazzani, and P. Langley, eds., pp. 323-353, San Mateo, Calif.: Morgan Kauffmann, 1991.
[27] R.S. Michalski, J.G. Carbonell, and T.M. Mitchell, eds., Machine Learning: An Artificial Intelligence Approach, Morgan Kaufmann, San Francisco, 1983.
[28] L. Talavera and J. Bejar, “Integrating Declarative Knowledge in Hierarchical Clustering Tasks,” Proc. Int'l Symp. Intelligent Data Analysis, pp. 211-222, 1999.
[29] K. Thompson and P. Langley, “Case Studies in the Use of Background Knowledge: Incremental Concept Formation,” Proc. AAAI-92 Workshop Constraining Learning with Prior Knowledge, pp. 60-68, 1992.
[30] K. Wagstaff and C. Cardie, “Clustering with Instance-Level Constraints,” Proc. 17th Int'l Conf. Machine Learning, pp. 1103-1110, June 2000.
[31] M. Zemankova,“Implementing imprecision in information systems,” Information Science, vol. 37, pp. 107-141, 1985.

Index Terms:
Agglomerative clustering, conceptual clustering, feature weighting, interpretation, knowledge discovery, mixed numeric and nominal data, similarity measures.
Cen Li, Gautam Biswas, "Unsupervised Learning with Mixed Numeric and Nominal Data," IEEE Transactions on Knowledge and Data Engineering, vol. 14, no. 4, pp. 673-690, July-Aug. 2002, doi:10.1109/TKDE.2002.1019208
Usage of this product signifies your acceptance of the Terms of Use.