loading...
  • Publication
  • PrePrints
  • Abstract - Clustering Uncertain Data Based on Probability Distribution Similarity
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Clustering Uncertain Data Based on Probability Distribution Similarity
PrePrint
ISSN: 1041-4347
Bin Jiang, Simon Fraser University, Burnaby
Jian Pei, Simon Fraser Univeristy, Burnaby
Yufei Tao, Chinese University of Hong Kong, Hong Kong
Xuemin Lin, The University of New South Wales, Sydney and East China Normal University, China
Clustering on uncertain data, one of the essential tasks in mining uncertain data, posts significant challenges on both modeling similarity between uncertain objects and developing efficient computational methods. The previous methods extend traditional partitioning clustering methods like k-means and density-based clustering methods like DBSCAN to uncertain data, thus rely on geometric distances between objects. Such methods cannot handle uncertain objects that are geometrically indistinguishable, such as products with the same mean but very different variances in customer ratings. Surprisingly, probability distributions, which are essential characteristics of uncertain objects, have not been considered in measuring similarity between uncertain objects. In this paper, We systematically model uncertain objects in both continuous and discrete domains, where an uncertain object is modeled as a continuous and discrete random variable, respectively. We use the well known Kullback-Leibler divergence to measure similarity between uncertain objects in both the continuous and discrete cases, and integrate it into partitioning and density-based clustering methods to cluster uncertain objects. To tackle the efficiency problem, we estimate KL divergence in the continuous case by kernel density estimation and employ the fast Gauss transform technique to further speed up the computation. Our extensive experiment results verify the effectiveness, efficiency, and scalability of our approaches.
Index Terms:
Clustering, Uncertainty, "fuzzy", and probabilistic reasoning
Citation:
Bin Jiang, Jian Pei, Yufei Tao, Xuemin Lin, "Clustering Uncertain Data Based on Probability Distribution Similarity," IEEE Transactions on Knowledge and Data Engineering, 14 Oct. 2011. IEEE computer Society Digital Library. IEEE Computer Society, <http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.221>
Usage of this product signifies your acceptance of the Terms of Use.