This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
A Survey of Uncertain Data Algorithms and Applications
May 2009 (vol. 21 no. 5)
pp. 609-623
Charu C. Aggarwal, IBM T. J. Watson Research Center, Hawthorne
Philip S. Yu, IBM T. J. Watson Research Center, Hawthorne
In recent years, a number of indirect data collection methodologies have lead to the proliferation of uncertain data. Such data points are often represented in the form of a probabilistic function, since the corresponding deterministic value is not known. This increases the challenge of mining and managing uncertain data, since the precise behavior of the underlying data is no longer known. In this paper, we provide a survey of uncertain data mining and management applications. In the field of uncertain data management, we will examine traditional methods such as join processing, query processing, selectivity estimation, OLAP queries, and indexing. In the field of uncertain data mining, we will examine traditional mining problems such as classification and clustering. We will also examine a general transform based technique for mining uncertain data. We discuss the models for uncertain data, and how they can be leveraged in a variety of applications. We discuss different methodologies to process and mine uncertain data in a variety of forms.

[1] S. Abiteboul, P. Kanellakis, and G. Grahne, “On the Representation and Querying of Sets of Possible Worlds,” Proc. ACM SIGMOD, 1987.
[2] Managing and Mining Uncertain Data, C. Aggarwal, ed. Springer, 2009.
[3] P. Andritsos, A. Fuxman, and R.J. Miller, “Clean Answers over Dirty Databases: A Probabilistic Approach,” Proc. 22nd IEEE Int'l Conf. Data Eng. (ICDE), 2006.
[4] L. Antova, C. Koch, and D. Olteanu, “From Complete to Incomplete Information and Back,” Proc. ACM SIGMOD, 2007.
[5] L. Antova, C. Koch, and D. Olteanu, “$10^{(10^{6})}$ Worlds and Beyond: Efficient Representation and Processing of Incomplete Information,” Proc. 23rd IEEE Int'l Conf. Data Eng. (ICDE), 2007.
[6] L. Antova, T. Jansen, C. Koch, and D. Olteanu, “Fast and Simple Relational Processing of Uncertain Data,” Proc. 24th IEEE Int'l Conf. Data Eng. (ICDE), 2008.
[7] C.C. Aggarwal and P.S. Yu, “Outlier Detection with Uncertain Data,” Proc. SIAM Int'l Conf. Data Mining (SDM), 2008.
[8] C.C. Aggarwal, “On Unifying Privacy and Uncertain Data Models,” Proc. 24th IEEE Int'l Conf. Data Eng. (ICDE), 2008.
[9] C.C. Aggarwal, “On Density Based Transformations for Uncertain Data Mining,” Proc. 23rd IEEE Int'l Conf. Data Eng. (ICDE), 2007.
[10] C.C. Aggarwal and P.S. Yu, “A Framework for Clustering Uncertain Data Streams,” Proc. 24th IEEE Int'l Conf. Data Eng. (ICDE), 2008.
[11] C.C. Aggarwal, J. Han, J. Wang, and P.S. Yu, “A Framework for Clustering Evolving Data Streams,” Proc. 29th Int'l Conf. Very Large Data Bases (VLDB), 2003.
[12] M. Arenas, L. Bertossi, and J. Chomicki, “Consistent Query Answers in Inconsistent Databases,” Proc. 18th ACM Symp. Principles of Database Systems (PODS), 1999.
[13] M. Ankerst, M.M. Breunig, H.-P. Kriegel, and J. Sander, “OPTICS: Ordering Points to Identify the Clustering Structure,” Proc. ACM SIGMOD, 1999.
[14] D. Barbara, H. Garcia-Molina, and D. Porter, “The Management of Probabilistic Data,” IEEE Trans. Knowledge and Data Eng., vol. 4, no. 5, pp. 487-502, Oct. 1992.
[15] D. Bell, J. Guan, and S. Lee, “Generalized Union and Project Operations for Pooling Uncertain and Imprecise Information,” Data and Knowledge Eng., vol. 18, no. 2, 1996.
[16] O. Benjelloun, A. Das Sarma, A. Halevy, and J. Widom, “ULDBs: Databases with Uncertainty and Lineage,” Proc. 32nd Int'l Conf. Very Large Data Bases (VLDB), 2006.
[17] J. Bi and T. Zhang, “Support Vector Machines with Input Data Uncertainty,” Proc. Advances in Neural Information Processing Systems (NIPS), 2004.
[18] C. Bohm, A. Pryakhin, and M. Schubert, “The Gauss-Tree: Efficient Object Identification of Probabilistic Feature Vectors,” Proc. 22nd IEEE Int'l Conf. Data Eng. (ICDE), 2006.
[19] C. Bohm, P. Kunath, A. Pryakhin, and M. Schubert, “Querying Objects Modeled by Arbitrary Probability Distributions,” Proc. 10th Int'l Symp. Spatial and Temporal Databases (SSTD), 2007.
[20] D. Burdick, P. Deshpande, T.S. Jayram, R. Ramakrishnan, and S. Vaithyanathan, “OLAP over Uncertain and Imprecise Data,” Proc. 31st Int'l Conf. Very Large Data Bases (VLDB '05), pp.970-981, 2005.
[21] D. Burdick, A. Doan, R. Ramakrishnan, and S. Vaithyanathan, “OLAP over Imprecise Data with Domain Constraints,” Proc. 33rdInt'l Conf. Very Large Data Bases (VLDB), 2007.
[22] R. Cavello and M. Pittarelli, “The Theory of Probabilistic Databases,” Proc. 13th Int'l Conf. Very Large Data Bases (VLDB), 1987.
[23] A.L.P. Chen, J.-S. Chiu, and F.S.-C. Tseng, “Evaluating Aggregate Operations over Imprecise Data,” IEEE Trans. Knowledge and Data Eng., vol. 8, no. 2, pp. 273-284, Apr. 1996.
[24] R. Cheng, Y. Xia, S. Prabhakar, R. Shah, and J. Vitter, “Efficient Indexing Methods for Probabilistic Threshold Queries over Uncertain Data,” Proc. 30th Int'l Conf. Very Large Data Bases (VLDB), 2004.
[25] R. Cheng, D. Kalashnikov, and S. Prabhakar, “Evaluating Probabilistic Queries over Imprecise Data,” Proc. ACM SIGMOD, 2003.
[26] R. Cheng, S. Singh, S. Prabhakar, R. Shah, J. Vitter, and Y. Xia, “Efficient Join Processing over Uncertain-Valued Attributes,” Proc. 15th ACM Int'l Conf. Information and Knowledge Management (CIKM), 2006.
[27] R. Cheng, Y. Xia, S. Prabhakar, R. Shah, J. Vitter, and Y. Xia, “Efficient Join Processing over Uncertain Data,” Technical Report CSD TR# 05-004, Dept. of Computer Science, Purdue Univ., 2005.
[28] R. Cheng, D. Kalashnikov, and S. Prabhakar, “Querying Imprecise Data in Moving Object Environments,” IEEE Trans. Knowledge and Data Eng., vol. 16, no. 9, pp. 1112-1127, Sept. 2004.
[29] C.-K. Chui, B. Kao, and E. Hung, “Mining Frequent Itemsets from Uncertain Data,” Proc. 11th Pacific-Asia Conf. Knowledge Discovery and Data Mining (PAKDD), 2007.
[30] C.-K. Chui and B. Kao, “A Decremental Approach for Mining Frequent Itemsets from Uncertain Data,” Proc. 12th Pacific-Asia Conf. Knowledge Discovery and Data Mining (PAKDD), 2008.
[31] G. Cormode and A. McGregor, “Approximation Algorithms for Clustering Uncertain Data,” Proc. 27th ACM SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems (PODS), 2008.
[32] N. Dalvi and D. Suciu, “Efficient Query Evaluation on Probabilistic Databases,” Proc. 30th Int'l Conf. Very Large Data Bases (VLDB), 2004.
[33] N. Dalvi and D. Suciu, “Query Answering Using Statistics and Probabilistic Views,” Proc. 31st Int'l Conf. Very Large Data Bases (VLDB), 2005.
[34] A. Das Sarma, O. Benjelloun, A. Halevy, and J. Widom, “Working Models for Uncertain Data,” Proc. 22nd IEEE Int'l Conf. Data Eng. (ICDE), 2006.
[35] A. Deshpande, C. Guestrin, S. Madden, J. Hellerstein, and W. Hong, “Model-Driven Data Acquisition in Sensor Networks,” Proc. 30th Int'l Conf. Very Large Data Bases (VLDB), 2004.
[36] D. Dey and S. Sarkar, “PSQL: A Query Language for Probabilistic Relational Data,” Knowledge and Data Eng., vol. 28, no. 1, pp. 107-120, 1998.
[37] X. Dong, A. Halevy, and C. Yu, “Data Integration with Uncertainty,” Proc. 33rd Int'l Conf. Very Large Data Bases (VLDB), 2007.
[38] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A Density Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise,” Proc. Second Int'l Conf. Knowledge Discovery and Data Mining (KDD), 1996.
[39] D. Florescu, D. Koller, and A. Levy, “Using Probabilistic Information in Data Integration,” Proc. 23rd Int'l Conf. Very Large Data Bases (VLDB), 1997.
[40] N. Friedman, L. Getoor, D. Koller, and A. Pfeffer, “Learning Probabilistic Relational Models,” Proc. 16th Int'l Joint Conf. Artificial Intelligence (IJCAI), 1999.
[41] N. Fuhr and T. Rolleke, “A Probabilistic Relational Algebra for the Integration of Information Retrieval and Database Systems,” ACM Trans. Information Systems, 1997.
[42] A. Fuxman, E. Fazli, and R.J. Miller, “Conquer: Efficient Management of Inconsistent Databases,” Proc. ACM SIGMOD, 2005.
[43] IEEE Data Eng. Bull., M. Garofalakis, and D. Suciu, eds., special issue on probabilistic data management, Mar. 2006.
[44] C. Genest and J. Zidek, “Combining Probability Distributions: A Critique and an Annotated Bibliography,” Statistical Science, vol. 1, pp. 114-148, 1986.
[45] T. Green and V. Tannen, “Models for Incomplete and Probabilistic Information,” Data Eng. Bull., vol. 29, no. 1, 2006.
[46] G. Grahne, The Problem of Incomplete Information in Relational Databases. Springer-Verlag, 1991.
[47] E. Hung, “ProbSem: A Probabilistic Semistructured Databases Model,” technical report, Univ. of Maryland, 2002.
[48] E. Hung, L. Getoor, and V. Subrahmanian, “PXML: A Probabilistic Semistructured Data Model and Algebra,” Proc. 19th IEEE Int'l Conf. Data Eng. (ICDE), 2003.
[49] T. Imielinski and W. Lipski Jr, “Incomplete Information in Relational Databases,” J. ACM, 1984.
[50] M. Keulen, A. Keijzer, and W. Alink, “A Probabilistic XML Approach to Data Integration,” Proc. 21st IEEE Int'l Conf. Data Eng. (ICDE), 2005.
[51] H.-P. Kriegel and M. Pfeifle, “Density-Based Clustering of Uncertain Data,” Proc. 11th ACM SIGKDD Int'l Conf. Knowledge Discovery in Data Mining (KDD), 2005.
[52] H.-P. Kriegel and M. Pfeifle, “Hierarchical Density Based Clustering of Uncertain Data,” Proc. Fifth IEEE Int'l Conf. Data Mining (ICDM), 2005.
[53] H.-P. Kriegel, P. Kunath, M. Pfeifle, and M. Renz, “Probabilistic Similarity Join over Uncertain Data,” Proc. 11th Int'l Conf. Database Systems for Advanced Applications (DASFAA), 2006.
[54] L.V.S. Lakshmanan, N. Leone, R. Ross, and V.S. Subrahmanian, “ProbView: A Flexible Database System,” ACM Trans. Database Systems, vol. 22, no. 3, pp. 419-469, 1997.
[55] S.K. Lee, “An Extended Relational Database Model for Uncertain and Imprecise Information,” Proc. 18th Int'l Conf. Very Large Data Bases (VLDB), 1992.
[56] R. Little and D. Rubin, Statistical Analysis with Missing Data. JohnWiley & Sons, 1987.
[57] V. Ljosa and A.K. Singh, “APLA: Indexing Arbitrary Probability Distributions,” Proc. 23rd IEEE Int'l Conf. Data Eng. (ICDE), 2007.
[58] V. Ljosa and A.K. Singh, “Top-$k$ Spatial Joins of Probabilistic Objects,” Proc. 24th IEEE Int'l Conf. Data Eng. (ICDE), 2008.
[59] S.I. McClean, B.W. Scotney, and M. Shapcott, “Aggregation of Imprecise and Uncertain Information,” IEEE Trans. Knowledge and Data Eng., vol. 13, no. 6, pp. 902-912, Nov./Dec. 2001.
[60] A. Motro, “Accommodating Imprecision in Database Systems: Issues and Solutions,” SIGMOD Record, vol. 19, no. 4, 1990.
[61] A. Motro, “Sources of Uncertainty, Inconsistency, and Imprecision in Database Systems,” Uncertainty Management in Information Systems, pp. 9-34, 1996.
[62] M. Mutsuzaki, M. Theobald, A. de Keijzer, J. Widom, P. Agrawal, O. Benjelloun, A.D. Sarma, R. Murthy, and T. Sugihara, “Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS,” Proc. Third Biennial Conf. Innovative Data Systems Research (CIDR), 2007.
[63] W. Ngai, B. Kao, C. Chui, R. Cheng, M. Chau, and K.Y. Yip, “Efficient Clustering of Uncertain Data,” Proc. Sixth IEEE Int'l Conf. Data Mining (ICDM), 2006.
[64] T. Pedersen, C. Jensen, and C. Dyreson, “Supporting Imprecision in Multidimensional Databases Using Granularities,” Proc. 11th Int'l Conf. Scientific and Statistical Database Management (SSDBM), 1999.
[65] J. Pei, B. Jiang, X. Lin, and Y. Yuan, “Probabilistic Skylines on Uncertain Data,” Proc. 33rd Int'l Conf. Very Large Data Bases (VLDB), 2007.
[66] A. Nierman and H. Jagadish, “ProTDB: Probabilistic Data in XML,” Proc. 28th Int'l Conf. Very Large Data Bases (VLDB), 2002.
[67] D. Pfozer and C. Jensen, “Capturing the Uncertainty of Moving Object Representations,” Proc. Int'l Conf. Solid State Devices and Materials (SSDM), 1999.
[68] H. Prade and C. Testemale, “Generalizing Database Relational Algebra for the Treatment of Incomplete or Uncertain Information and Vague Queries,” Information Sciences, 1984.
[69] S. Singh, C. Mayfield, S. Prabhakar, R. Shah, and S. Hambrusch, “Indexing Uncertain Categorical Data,” Proc. 23rd IEEE Int'l Conf. Data Eng. (ICDE), 2007.
[70] R. Ross, V. Subrahmanian, and J. Grant, “Aggregate Operators in Probabilistic Databases,” J. ACM, vol. 52, no. 1, 2005.
[71] E. Rundensteiner and L. Bic, “Evaluating Aggregates in Probabilistic Databases,” Data and Knowledge Eng., vol. 7, no. 3, pp. 239-267, 1992.
[72] F. Sadri, “Modeling Uncertainty in Databases,” Proc. Seventh IEEE Int'l Conf. Data Eng. (ICDE), 1991.
[73] P. Sen and A. Deshpande, “Representing and Querying Correlated Tuples in Probabilistic Databases,” Proc. 23rd IEEE Int'l Conf. Data Eng. (ICDE), 2007.
[74] M. Soliman, I. Ilyas, and K.C.-C. Chang, “Top $k$ -Query Processing in Uncertain Databases,” Proc. 23rd IEEE Int'l Conf. Data Eng. (ICDE), 2007.
[75] B.W. Silverman, Density Estimation for Statistics and Data Analysis. Chapman and Hall, 1986.
[76] S. Singh, C. Mayfield, S. Prabhakar, R. Shah, and S. Hambrusch, “Indexing Uncertain Categorical Data,” Proc. 22nd IEEE Int'l Conf. Data Eng. (ICDE), 2006.
[77] Y. Tao, R. Cheng, X. Xiao, W. Ngai, B. Kao, and S. Prabhakar, “Indexing Multi-Dimensional Uncertain Data with Arbitrary Probability Density Functions,” Proc. 31st Int'l Conf. Very Large Data Bases (VLDB), 2005.
[78] Q. Zhang, F. Li, and K. Yi, “Finding Frequent Items in Probabilistic Data,” Proc. ACM SIGMOD, 2008.
[79] W. Zhao, A. Dekhtyar, and J. Goldsmith, “A Framework for Management of Semistructured Probabilistic Data,” J. Intelligent Information Systems, pp. 1-39, 2004.
[80] E. Zimanyi, “Query Evaluation in Probabilistic Databases,” Theoretical Computer Science, vol. 171, nos. 1/2, 1997.

Index Terms:
Mining methods and algorithms
Citation:
Charu C. Aggarwal, Philip S. Yu, "A Survey of Uncertain Data Algorithms and Applications," IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 5, pp. 609-623, May 2009, doi:10.1109/TKDE.2008.190
Usage of this product signifies your acceptance of the Terms of Use.