In this paper we show that most hierarchical agglomerative clustering (HAC)algorithms follow a 90-10 rule where roughly 90%iterations from the beginning merge cluster pairs with dissimilarity less than 10%of the maximum dissimilarity. We propose two algorithms - 2-phase and nested - based on partially overlapping partitioning (POP).To handle high-dimensional data efficiently, we propose a tree structure particularly suitable for POP. Extensive experiments show that the proposed algorithms reduce the time and memory requirement of existing HAC algorithms significantly without compromising in accuracy.
Citation:
Manoranjan Dash, Kian Lee Tan, Huan Liu, "Efficient Yet Accurate Clustering," icdm, pp.99, First IEEE International Conference on Data Mining (ICDM'01), 2001