22nd International Conference on Data Engineering (ICDE'06) Techniques for Warehousing of Sample Data Atlanta, Georgia April 03-April 07 ISBN: 0-7695-2570-9
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/ICDE.2006.157
We consider the problem of maintaining a warehouse of sampled data that "shadows" a full-scale data warehouse, in order to support quick approximate analytics and metadata discovery. The full-scale warehouse comprises many "data sets," where a data set is a bag of values; the data sets can vary enormously in size. The values constituting a data set can arrive in batch or stream form. We provide and compare several new algorithms for independent and parallel uniform random sampling of data-set partitions, where the partitions are created by dividing the batch or splitting the stream. We also provide novel methods for merging samples to create a uniform sample from an arbitrary union of data-set partitions. Our sampling/merge methods are the first to simultaneously support statistical uniformity, a priori bounds on the sample footprint, and concise sample storage. As partitions are rolled in and out of the warehouse, the corresponding samples are rolled in and out of the sample warehouse. In this manner our sampling methods approximate the behavior of more sophisticated stream-sampling methods, while also supporting parallel processing. Experiments indicate that our methods are efficient and scalable, and provide guidance for their application.
Citation:
Paul G. Brown, Peter J. Haas, "Techniques for Warehousing of Sample Data," icde, pp.6, 22nd International Conference on Data Engineering (ICDE'06), 2006 Usage of this product signifies your acceptance of the Terms of Use. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||