July 3, 2006 to July 5, 2006
Rong Ge , Simon Fraser University
Martin Ester , Simon Fraser University
Wen Jin , Simon Fraser University
Zengjian Hu , Simon Fraser University
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/SSDBM.2006.6
Data summarization has been recognized as a fundamental operation in database systems and data mining with important applications such as data compression and privacy preservation. While the existing methods such as CFvalues and DataBubbles may perform reasonably well, they cannot provide any guarantees on the quality of their results. In this paper, we introduce a summarization approach for numerical data based on discs formalizing the notion of quality. Our objective is to find a minimal set of discs, i.e. spheres satisfying a radius and a significance constraint, covering the given dataset. Since the proposed problem is NP-complete, we design two different approximation algorithms. These algorithms have a quality guarantee, but they do not scale well to large databases. However, the machinery from approximation algorithms allows a precise characterization of a further, heuristic algorithm. This heuristic, efficient algorithm exploits multi-dimensional index structures and can be well-integrated with database systems. The experiments show that our heuristic algorithm generates summaries that outperform the state-of-the-art Data Bubbles in terms of internal measures as well as in terms of external measures when using the data summaries as input for clustering methods.
Rong Ge, Martin Ester, Wen Jin, Zengjian Hu, "A Disc-based Approach to Data Summarization and Privacy Preservation", SSDBM, 2006, Scientific and Statistical Database Management, International Conference on, Scientific and Statistical Database Management, International Conference on 2006, pp. 321-332, doi:10.1109/SSDBM.2006.6