2009 Ninth IEEE International Conference on Data Mining A Tree-Based Framework for Difference Summarization Miami, Florida December 06-December 09 ISBN: 978-0-7695-3895-2
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/ICDM.2009.68
Understanding the differences between two datasets is a fundamental data mining question and is also ubiquitously important across many real world scientific applications. In this paper, we propose a tree-based framework to provide a parsimonious explanation of the difference between two distributions based on rigorous two-sample statistical test. We develop two efficient approaches. The first one is a dynamic programming approach that finds a minimal number of data subsets that describe the difference between two data sets. The second one is a greedy approach that approximates the dynamic programming approach. We employ the well-known Friedman's MST (minimal spanning tree) statistics for two-sample statistical tests in our summarization tree construction, and develop novel techniques to speedup its computational procedure. We performed a detailed experimental evaluation on both real and synthetic datasets and demonstrated the effectiveness of our tree-summarization approach.
Index Terms:
difference summarization, minimal spanning tree, two-sample test, Friedman-Rafsky test, Chi-square test, Kolmogorov-Smirnov test
Citation:
Ruoming Jin, Yuri Breitbart, Rong Li, "A Tree-Based Framework for Difference Summarization," icdm, pp.209-218, 2009 Ninth IEEE International Conference on Data Mining, 2009 Usage of this product signifies your acceptance of the Terms of Use. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||