
This Article  
 
Share  
Bibliographic References  
Add to:  
Digg Furl Spurl Blink Simpy Del.icio.us Y!MyWeb  
Search  
 
ASCII Text  x  
George Kollios, Dimitrios Gunopulos, Nick Koudas, Stefan Berchtold, "Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Data Sets," IEEE Transactions on Knowledge and Data Engineering, vol. 15, no. 5, pp. 11701187, September/October, 2003.  
BibTex  x  
@article{ 10.1109/TKDE.2003.1232271, author = {George Kollios and Dimitrios Gunopulos and Nick Koudas and Stefan Berchtold}, title = {Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Data Sets}, journal ={IEEE Transactions on Knowledge and Data Engineering}, volume = {15}, number = {5}, issn = {10414347}, year = {2003}, pages = {11701187}, doi = {http://doi.ieeecomputersociety.org/10.1109/TKDE.2003.1232271}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, }  
RefWorks Procite/RefMan/Endnote  x  
TY  JOUR JO  IEEE Transactions on Knowledge and Data Engineering TI  Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Data Sets IS  5 SN  10414347 SP1170 EP1187 EPD  11701187 A1  George Kollios, A1  Dimitrios Gunopulos, A1  Nick Koudas, A1  Stefan Berchtold, PY  2003 KW  Data mining KW  sampling KW  clustering KW  outlier detection. VL  15 JA  IEEE Transactions on Knowledge and Data Engineering ER   
Abstract—We investigate the use of biased sampling according to the density of the data set to speed up the operation of general data mining tasks, such as clustering and outlier detection in large multidimensional data sets. In densitybiased sampling, the probability that a given point will be included in the sample depends on the local density of the data set. We propose a general technique for densitybiased sampling that can factor in user requirements to sample for properties of interest and can be tuned for specific data mining tasks. This allows great flexibility and improved accuracy of the results over simple random sampling. We describe our approach in detail, we analytically evaluate it, and show how it can be optimized for approximate clustering and outlier detection. Finally, we present a thorough experimental evaluation of the proposed method, applying densitybiased sampling on real and synthetic data sets, and employing clustering and outlier detection algorithms, thus highlighting the utility of our approach.
[1] R. Aggrawal et al., "Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications," Proc. ACM SIGMOD Int'l Conf. Management of Data, ACM Press, 1998, pp. 94105.
[2] S. Acharya, P. Gibbons, V. Poosala, and S. Ramaswamy, Join Synopses for Approximate Query Answering Proc. SIGMOD, pp. 275286, June 1999.
[3] S. Acharya, V. Poosala, and S. Ramaswamy, Selectivity Estimation in Spatial Databases Proc. SIGMOD, June 1999.
[4] D. Barbara, C. Faloutsos, J. Hellerstein, Y. Ioannidis, H.V. Jagadish, T. Johnson, R. Ng, V. Poosala, K. Ross, and K.C. Sevcik, The New Jersey Data Reduction Report Data Eng. Bull., Sept. 1996.
[5] P. Bradley, U. Fayyad, and C. Reina, Scaling Em (ExpectationMaximization) Clustering to Large Databases Microsoft Research Report, MSRTR9835, Aug. 1998.
[6] M. Breunig, H.P. Kriegel, R. Ng, and J. Sander, LOF: Identifying DensityBased Local Outliers Proc. SIGMOD, May 2000.
[7] B. Blohsfeld, D. Korus, and B. Seeger, A Comparison of Selectivity Estimators for Range Queries on Metric Attributes Proc. 1999 ACM SIGMOD Int'l Conf. Management of Data, 1999.
[8] Y. Barnett and T. Lewis, Outliers in Statistical Data. John Wiley&Sons, 1994.
[9] A. Borodin, R. Ostrovsky, and Y. Rabani, Subquadratic Approximation Algorithms for Clustering Problems in High Dimensional Spaces Proc. Ann. ACM Symp. Theory of Computing, pp. 435444, May 1999.
[10] S. Chaudhuri, R. Motwani, and V. Narasayya, Random Sampling for Histogram Construction: How Much Is Enough? Proc. SIGMOD, pp. 436447, June 1998.
[11] S. Chaudhuri, R. Motwani, and V. Narasayya, On Random Sampling over Joins Proc. SIGMOD, pp. 263274, June 1999.
[12] N.A.C. Cressie, Statistics For Spatial Data. Wiley&Sons, 1993.
[13] A.P. Dempster, N.M. Laird, and D.B. Rubin, Maximum Likelihood from Incomplete Data Via Em Algorithm J. Royal Statistical Soc., Series B (Methodological), vol. 39, no. 1, pp. 138, 1977.
[14] M. Ester, H.P. Kriegel, J. Sander, and X. Xu, A Density Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise Proc. Int'l Conf. Knowledge Discovery and Databases, Aug. 1996.
[15] D. Gunopulos, G. Kollios, V. Tsotras, and C. Domeniconi, Approximating MultiDimensional Aggregate Range Queries over Real Attributes Proc. SIGMOD, May 2000.
[16] P.B. Gibbons and Y. Matias, New SamplingBased Summary Statistics for Improving Approximate Query Answers Proc. 1998 ACM SIGMOD Int'l Conf. Management of Data, 1998.
[17] S. Guha, R. Rastogi, and K. Shim, CURE: An Efficient Clustering Algorithm for Large Databases Proc. ACM SIGMOD, pp. 7384, June 1998.
[18] P. Haas and A. Swami, “Sequential Sampling Procedures for Query Size Estimation,” Proc. ACM SIGMOD, pp. 341350, June 1992.
[19] P. Indyk, Sublinear Time Algorithms for Metric Space Problems Proc. 31st Symp. Theory of Computing, 1999.
[20] P. Indyk, A SublinearTime Approximation Scheme for Clustering in Metric Spaces Proc. 40th Symp. Foundations of Computer Science, 1999.
[21] H.V. Jagadish, N. Koudas, and S. Muthukrishnan, Mining Deviants in a Time Series Database Proc. Very Large Data Bases Conf., 1999.
[22] F. Korn, T. Johnson, and H. Jagadish, Range Selectivity Estimation for Continuous Attributes Proc. 11th Int'l Conf. SSDBMs, 1999.
[23] E. Knorr and R. Ng, Algorithms for Mining Distance Based Outliers in Large Databases Proc. Very Large Data Bases Conf., pp. 392403, Aug. 1998.
[24] E. Knorr and R. Ng, Finding Intensional Knowledge of Distance Based Outliers Proc. Very Large Data Bases, pp. 211222, Sept. 1999.
[25] J. Lee, D. Kim, and C. Chung, MultiDimensional Selectivity Estimation Using Compressed Histogram Information Proc. 1999 ACM SIGMOD Int'l Conf. Management of Data, 1999.
[26] R.J. Lipton, J.F. Naughton, and D.A. Schneider, Practical Selectivity Estimation through Adaptive Sampling Proc. ACM SIGMOD, pp. 111, May 1990.
[27] Y. Matias, J.S. Vitter, and M. Wang, WaveletBased Histograms for Selectivity Estimation Proc. 1998 ACM SIGMOD Int'l Conf. Management of Data, 1998.
[28] R.T. Ng and J. Han, "Efficient and Effective Clustering Methods for Spatial Data Mining," Proc. 20th Int'l Conf. Very Large Databases, Morgan Kaufmann, 1994, pp. 144155.
[29] F. Olken and D. Rotem, Random Sampling from Database Files: A Survey Proc. Fifth Int'l Conf. Statistical and Scientific Database Management, 1990.
[30] C. Palmer and C. Faloutsos, Density Biased Sampling: An Improved Method for Data Mining and Clustering Proc. SIGMOD, May 2000.
[31] V. Poosala and Y. Ioannidis, “Selectivity Estimation without the Attribute Value Independence Assumption,” Proc. 23rd Int'l Conf. Very Large Data Bases (VLDB), 1997.
[32] P. Rousseeuw and A. Leory, Robust Regression and Outlier Detection. Wiley Series in Probability and Statistics, 1987.
[33] S. Ramaswamy, R. Rastogi, and K. Shim, Efficient Algorithms for Mining Outliers from Large Data Sets Proc. SIGMOD, May 2000.
[34] D. Scott, Multivariate Density Estimation: Theory, Practice and Visualization. Wiley and Sons, 1992.
[35] B.W. Silverman, Density Estimation for Statistics and Data Analysis Monographs on Statistics and Applied Probability, Chapman&Hall, 1986.
[36] G. Singh, S. Rajagopalan, and B. Lindsay, Random Sampling Techniques for Space Efficient Computation Of Large Data Sets Proc. SIGMOD, June 1999.
[37] S.K. Thompson, Sampling. New York: John Wiley&Sons, 1992.
[38] H. Toivonen, “Sampling Large Databases for Association Rules,” Proc. 1996 Int'l Conf. Very Large Data Bases, pp. 134145, Sept. 1996.
[39] S.K. Thompson and G.A.F. Seber, Adaptive Sampling. New York: John Wiley&Sons, 1996.
[40] J.S. Vitter, “Random Sampling with Reservoir,” ACM Trans. Math. Software, vol. 11, pp. 3757, Mar. 1985.
[41] J.S. Vitter, M. Wang, and B.R. Iyer, Data Cube Approximation and Histograms via Wavelets Proc. 1998 ACM CIKM Int'l Conf. Information and Knowledge Management, 1998.
[42] M.P. Wand and M.C. Jones, Kernel Smoothing Monographs on Statistics and Applied Probability, Chapman and Hall, 1995.
[43] T. Zhang, R. Ramakrishnan, and M. Livny, "Birch: An Efficient Data Clustering Method for Very Large Databases," Proc. ACM SIGMOD Int'l Conf. Management of Data, ACM Press, 1996, pp. 103114.