loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
2009 Ninth IEEE International Conference on Data Mining
Scalable Algorithms for Distribution Search
Miami, Florida
December 06-December 09
ISBN: 978-0-7695-3895-2
Distribution data naturally arise in countless domains, such as meteorology, biology, geology, industry and economics. However, relatively little attention has been paid to data mining for large distribution sets. Given n distributions of multiple categories and a query distribution Q, we want to find similar clouds (i.e., distributions), to discover patterns, rules and outlier clouds. For example, consider the numerical case of sales of items, where, for each item sold, we record the unit price and quantity; then, each customer is represented as a distribution of 2-d points (one for each item he/she bought). We want to find similar users, e.g., for market segmentation, anomaly/fraud detection. We propose to address this problem and present D-Search, which includes fast and effective algorithms for similarity search in large distribution datasets. Our main contributions are (1) approximate KL divergence, which can speed up cloud-similarity computations, (2) multi-step sequential scan, which efficiently prunes a significant number of search candidates and leads to a direct reduction in the search cost. We also introduce an extended version of D-Search: (3) time-series distribution mining, which finds similar subsequences in time-series distribution datasets. Extensive experiments on real multi-dimensional datasets show that our solution achieves up to 2,300 faster wall-clock time over the naive implementation while it does not sacrifice accuracy.
Index Terms:
Distribution sets, KL divergence, Singular value decomposition
Citation:
Yasuko Matsubara, Yasushi Sakurai, Masatoshi Yoshikawa, "Scalable Algorithms for Distribution Search," icdm, pp.347-356, 2009 Ninth IEEE International Conference on Data Mining, 2009
Usage of this product signifies your acceptance of the Terms of Use.