This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Data Sets
September/October 2003 (vol. 15 no. 5)
pp. 1170-1187

Abstract—We investigate the use of biased sampling according to the density of the data set to speed up the operation of general data mining tasks, such as clustering and outlier detection in large multidimensional data sets. In density-biased sampling, the probability that a given point will be included in the sample depends on the local density of the data set. We propose a general technique for density-biased sampling that can factor in user requirements to sample for properties of interest and can be tuned for specific data mining tasks. This allows great flexibility and improved accuracy of the results over simple random sampling. We describe our approach in detail, we analytically evaluate it, and show how it can be optimized for approximate clustering and outlier detection. Finally, we present a thorough experimental evaluation of the proposed method, applying density-biased sampling on real and synthetic data sets, and employing clustering and outlier detection algorithms, thus highlighting the utility of our approach.

[1] R. Aggrawal et al., "Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications," Proc. ACM SIGMOD Int'l Conf. Management of Data, ACM Press, 1998, pp. 94-105.
[2] S. Acharya, P. Gibbons, V. Poosala, and S. Ramaswamy, Join Synopses for Approximate Query Answering Proc. SIGMOD, pp. 275-286, June 1999.
[3] S. Acharya, V. Poosala, and S. Ramaswamy, Selectivity Estimation in Spatial Databases Proc. SIGMOD, June 1999.
[4] D. Barbara, C. Faloutsos, J. Hellerstein, Y. Ioannidis, H.V. Jagadish, T. Johnson, R. Ng, V. Poosala, K. Ross, and K.C. Sevcik, The New Jersey Data Reduction Report Data Eng. Bull., Sept. 1996.
[5] P. Bradley, U. Fayyad, and C. Reina, Scaling Em (Expectation-Maximization) Clustering to Large Databases Microsoft Research Report, MSR-TR-98-35, Aug. 1998.
[6] M. Breunig, H.P. Kriegel, R. Ng, and J. Sander, LOF: Identifying Density-Based Local Outliers Proc. SIGMOD, May 2000.
[7] B. Blohsfeld, D. Korus, and B. Seeger, A Comparison of Selectivity Estimators for Range Queries on Metric Attributes Proc. 1999 ACM SIGMOD Int'l Conf. Management of Data, 1999.
[8] Y. Barnett and T. Lewis, Outliers in Statistical Data. John Wiley&Sons, 1994.
[9] A. Borodin, R. Ostrovsky, and Y. Rabani, Subquadratic Approximation Algorithms for Clustering Problems in High Dimensional Spaces Proc. Ann. ACM Symp. Theory of Computing, pp. 435-444, May 1999.
[10] S. Chaudhuri, R. Motwani, and V. Narasayya, Random Sampling for Histogram Construction: How Much Is Enough? Proc. SIGMOD, pp. 436-447, June 1998.
[11] S. Chaudhuri, R. Motwani, and V. Narasayya, On Random Sampling over Joins Proc. SIGMOD, pp. 263-274, June 1999.
[12] N.A.C. Cressie, Statistics For Spatial Data. Wiley&Sons, 1993.
[13] A.P. Dempster, N.M. Laird, and D.B. Rubin, Maximum Likelihood from Incomplete Data Via Em Algorithm J. Royal Statistical Soc., Series B (Methodological), vol. 39, no. 1, pp. 1-38, 1977.
[14] M. Ester, H.P. Kriegel, J. Sander, and X. Xu, A Density Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise Proc. Int'l Conf. Knowledge Discovery and Databases, Aug. 1996.
[15] D. Gunopulos, G. Kollios, V. Tsotras, and C. Domeniconi, Approximating Multi-Dimensional Aggregate Range Queries over Real Attributes Proc. SIGMOD, May 2000.
[16] P.B. Gibbons and Y. Matias, New Sampling-Based Summary Statistics for Improving Approximate Query Answers Proc. 1998 ACM SIGMOD Int'l Conf. Management of Data, 1998.
[17] S. Guha, R. Rastogi, and K. Shim, CURE: An Efficient Clustering Algorithm for Large Databases Proc. ACM SIGMOD, pp. 73-84, June 1998.
[18] P. Haas and A. Swami, “Sequential Sampling Procedures for Query Size Estimation,” Proc. ACM SIGMOD, pp. 341-350, June 1992.
[19] P. Indyk, Sublinear Time Algorithms for Metric Space Problems Proc. 31st Symp. Theory of Computing, 1999.
[20] P. Indyk, A Sublinear-Time Approximation Scheme for Clustering in Metric Spaces Proc. 40th Symp. Foundations of Computer Science, 1999.
[21] H.V. Jagadish, N. Koudas, and S. Muthukrishnan, Mining Deviants in a Time Series Database Proc. Very Large Data Bases Conf., 1999.
[22] F. Korn, T. Johnson, and H. Jagadish, Range Selectivity Estimation for Continuous Attributes Proc. 11th Int'l Conf. SSDBMs, 1999.
[23] E. Knorr and R. Ng, Algorithms for Mining Distance Based Outliers in Large Databases Proc. Very Large Data Bases Conf., pp. 392-403, Aug. 1998.
[24] E. Knorr and R. Ng, Finding Intensional Knowledge of Distance Based Outliers Proc. Very Large Data Bases, pp. 211-222, Sept. 1999.
[25] J. Lee, D. Kim, and C. Chung, Multi-Dimensional Selectivity Estimation Using Compressed Histogram Information Proc. 1999 ACM SIGMOD Int'l Conf. Management of Data, 1999.
[26] R.J. Lipton, J.F. Naughton, and D.A. Schneider, Practical Selectivity Estimation through Adaptive Sampling Proc. ACM SIGMOD, pp. 1-11, May 1990.
[27] Y. Matias, J.S. Vitter, and M. Wang, Wavelet-Based Histograms for Selectivity Estimation Proc. 1998 ACM SIGMOD Int'l Conf. Management of Data, 1998.
[28] R.T. Ng and J. Han, "Efficient and Effective Clustering Methods for Spatial Data Mining," Proc. 20th Int'l Conf. Very Large Databases, Morgan Kaufmann, 1994, pp. 144-155.
[29] F. Olken and D. Rotem, Random Sampling from Database Files: A Survey Proc. Fifth Int'l Conf. Statistical and Scientific Database Management, 1990.
[30] C. Palmer and C. Faloutsos, Density Biased Sampling: An Improved Method for Data Mining and Clustering Proc. SIGMOD, May 2000.
[31] V. Poosala and Y. Ioannidis, “Selectivity Estimation without the Attribute Value Independence Assumption,” Proc. 23rd Int'l Conf. Very Large Data Bases (VLDB), 1997.
[32] P. Rousseeuw and A. Leory, Robust Regression and Outlier Detection. Wiley Series in Probability and Statistics, 1987.
[33] S. Ramaswamy, R. Rastogi, and K. Shim, Efficient Algorithms for Mining Outliers from Large Data Sets Proc. SIGMOD, May 2000.
[34] D. Scott, Multivariate Density Estimation: Theory, Practice and Visualization. Wiley and Sons, 1992.
[35] B.W. Silverman, Density Estimation for Statistics and Data Analysis Monographs on Statistics and Applied Probability, Chapman&Hall, 1986.
[36] G. Singh, S. Rajagopalan, and B. Lindsay, Random Sampling Techniques for Space Efficient Computation Of Large Data Sets Proc. SIGMOD, June 1999.
[37] S.K. Thompson, Sampling. New York: John Wiley&Sons, 1992.
[38] H. Toivonen, “Sampling Large Databases for Association Rules,” Proc. 1996 Int'l Conf. Very Large Data Bases, pp. 134-145, Sept. 1996.
[39] S.K. Thompson and G.A.F. Seber, Adaptive Sampling. New York: John Wiley&Sons, 1996.
[40] J.S. Vitter, “Random Sampling with Reservoir,” ACM Trans. Math. Software, vol. 11, pp. 37-57, Mar. 1985.
[41] J.S. Vitter, M. Wang, and B.R. Iyer, Data Cube Approximation and Histograms via Wavelets Proc. 1998 ACM CIKM Int'l Conf. Information and Knowledge Management, 1998.
[42] M.P. Wand and M.C. Jones, Kernel Smoothing Monographs on Statistics and Applied Probability, Chapman and Hall, 1995.
[43] T. Zhang, R. Ramakrishnan, and M. Livny, "Birch: An Efficient Data Clustering Method for Very Large Databases," Proc. ACM SIGMOD Int'l Conf. Management of Data, ACM Press, 1996, pp. 103-114.

Index Terms:
Data mining, sampling, clustering, outlier detection.
Citation:
George Kollios, Dimitrios Gunopulos, Nick Koudas, Stefan Berchtold, "Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Data Sets," IEEE Transactions on Knowledge and Data Engineering, vol. 15, no. 5, pp. 1170-1187, Sept.-Oct. 2003, doi:10.1109/TKDE.2003.1232271
Usage of this product signifies your acceptance of the Terms of Use.