This Article 
 Bibliographic References 
 Add to: 
Clustering Without a Metric
February 1991 (vol. 13 no. 2)
pp. 175-184

A methodology for clustering data in which a distance metric or similarity function is not used is described. Instead, clusterings are optimized based on their intended function: the accurate prediction of properties of the data. The resulting clustering methodology is applicable, without further ad hoc assumptions or transformations of the data, (1) when features are heterogeneous (both discrete and continuous) and not combinable, (2) where some data points have missing feature values, and (3) where some features are irrelevant, i.e. have large variance but little correlation with other features. Further, it provides an integral measure of the quality of the resulting clustering. A clustering program, RIFFLE, has been implemented in line with this approach, and experiments with synthetic and real data show that the clustering is, in many respects, superior to traditional methods.

[1] R. S. Michalski and R. E. Stepp, "Learning from observation: Conceptual clustering," inMachine Learning, An Artificial Intelligence Approach. Los Altos, CA: Morgan Kaufman, 1983, pp. 331-363.
[2] P. Cheeseman, J. Kelly, M. Self, J. Stutz, W. Taylor, and D. Freeman, "Autoclass: A Bayesian classification system," inProc. Fifth Int. Conf. Machine Learning. Las Altos, CA: Morgan Kaufmann, 1988.
[3] L. Breiman, J.H. Freidman, R.A. Olshen, and C.J. Stone,Classification and Regression Trees. Belmont, CA: Wadsworth, 1984.
[4] E.G. Henrichon, Jr. and K. S. Fu, "A nonparametric partitioning procedure for pattern classification,"IEEE Trans. Comput., vol. C-18, pp. 614-624, 1969.
[5] J. A. Sonquist, E. L. Baker, and J. N. Morgan,Searching for Structure, Review ed. Ann Arbor, MI: Inst. Social Res., Univ. Michigan, 1973.
[6] L.A. Goodman and W.H. Kruskal, "Measures of association for cross classifications,"J. Amer. Statist. Assoc., vol. 49, pp. 732-764, 1954.
[7] L.A. Goodman and W.H. Kruskal, "Measures of association for cross classifications ii: Further discussion and references,"J. Amer. Statist. Assoc., vol. 54, pp. 123-163, 1959.
[8] L.A. Goodman and W.H. Kruskal, "Measures of association for cross classifications iii: Approximate sampling theory,"J. Amer. Statist. Assoc., vol. 58, pp. 310-364, 1963.
[9] L.A. Goodman and W.H. Kruskal, "Measures of association for cross classifications iv: Simplification of asymptotic variances,"J. Amer. Statist. Assoc., vol. 67, pp. 415-421, 1972.
[10] L. Guttman,An Outline of the Statistical Theory of Prediction. New York: S.S.R.C, 1941.
[11] A. K. Jain and R. C. Dubes,Algorithms for Clustering Data. Englewood Cliffs, NJ: Prentice-Hall, 1988.
[12] C. K. Bayne, J. J. Beauchamp, C. L. Begovich, and V. E. Kane, "Monte Carlo comparisons of selected clustering procedures,"Pattern Recog., vol. 12, pp. 51-62, 1980.
[13] J. Rissanen, "Stochastic complexity and modeling,"Ann. Statist., vol. 14, no. 3, pp. 1080-1100, 1986.
[14] R. A. Fisher, "Multiple measurements in taxonomic problems,"Ann. Eugenics, vol. 7, pp. 179-188, 1936.
[15] R.C. Dubes, "How many clusters are best?--An experiment,"Pattern Recog., vol. 20, no. 6, pp. 645-663, 1987.
[16] I. Gath and A.B. Geva, "Unsupervised optimal fuzzy clustering."IEEE Trans. Pattern Anal. Machine Intell., vol. 11, no. 7, pp. 773-781, 1989.
[17] W. J. Ehinger, "Phytoplankton composition and temporal variation among the three basins of Lake Whatcom, Washington," Master's thesis, Western Washington Univ., Bellingham, WA, 1988.
[18] G. B. Matthews, "Clustering heterogeneous ecological data," inAnnu. Conf. Int. Soc. Ecological Modeling, Davis, CA, 1988.
[19] R.A. Linthurst, D. H. Landers, J. M. Eilers, D. F. Brakke, W. S. Overton, E. P. Meier, and R. E. Crowe,Characteristics of Lakes in the Eastern United State, Volume I. Population Descriptions and PhysicoChemical Relationships. Washington, DC: U.S. Environmental Protection Agency, No. EPA/600/4-86/007a, 1986.
[20] R. A. Matthews, G. B. Matthews, and B. Hachmöller, "Ordination of benthic macroinvertebrates along a longitudinal stream gradient," inAnnu. Conf. North American Benthological Society, Blacksburg, VA, 1990.

Index Terms:
statistical analysis; pattern recognition; optimisation; clustering; distance metric; similarity function; RIFFLE; optimisation; pattern recognition; statistical analysis
G. Matthews, J. Hearne, "Clustering Without a Metric," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 13, no. 2, pp. 175-184, Feb. 1991, doi:10.1109/34.67646
Usage of this product signifies your acceptance of the Terms of Use.