The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.12 - December (2011 vol.23)
pp: 1903-1917
Jeffrey Jestes , Florida State University, Tallahassee
Graham Cormode , AT&T Labs-Research, Florham Park
Feifei Li , Florida State University, Tallahassee
Ke Yi , Hong Kong University of Science and Technology, Hong Kong
ABSTRACT
Recently, there have been several attempts to propose definitions and algorithms for ranking queries on probabilistic data. However, these lack many intuitive properties of a top-k over deterministic data. We define several fundamental properties, including exact-k, containment, unique rank, value invariance, and stability, which are satisfied by ranking queries on certain data. We argue that these properties should also be carefully studied in defining ranking queries in probabilistic data, and fulfilled by definition for ranking uncertain data for most applications. We propose an intuitive new ranking definition based on the observation that the ranks of a tuple across all possible worlds represent a well-founded rank distribution. We studied the ranking definitions based on the expectation, the median, and other statistics of this rank distribution for a tuple and derived the expected rank, median rank, and quantile rank correspondingly. We are able to prove that the expected rank, median rank, and quantile rank satisfy all these properties for a ranking query. We provide efficient solutions to compute such rankings across the major models of uncertain data, such as attribute-level and tuple-level uncertainty. Finally, a comprehensive experimental study confirms the effectiveness of our approach.
INDEX TERMS
Probabilistic data, ranking queries, top-k queries, uncertain database.
CITATION
Jeffrey Jestes, Graham Cormode, Feifei Li, Ke Yi, "Semantics of Ranking Queries for Probabilistic Data", IEEE Transactions on Knowledge & Data Engineering, vol.23, no. 12, pp. 1903-1917, December 2011, doi:10.1109/TKDE.2010.192
REFERENCES
[1] P. Agrawal, O. Benjelloun, A. Das Sarma, C. Hayworth, S. Nabar, T. Sugihara, and J. Widom, "Trio: A System for Data, Uncertainty, and Lineage," Proc. Int'l Conf. Very Large Data Bases (VLDB), 2006.
[2] L. Antova, C. Koch, and D. Olteanu, "$10^{10^6}$ Worlds and Beyond: Efficient Representation and Processing of Incomplete Information," Proc. IEEE Int'l Conf. Data Eng. (ICDE), 2007.
[3] L. Antova, T. Jansen, C. Koch, and D. Olteanu, "Fast and Simple Relational Processing of Uncertain Data," Proc. IEEE Int'l Conf. Data Eng. (ICDE), 2008.
[4] O. Benjelloun, A.D. Sarma, A. Halevy, and J. Widom, "ULDBs: Databases with Uncertainty and Lineage," Proc. Int'l Conf. Very Large Data Bases (VLDB), 2006.
[5] T. Bernecker, H.-P. Kriegel, M. Renz, F. Verhein, and A. Zuefle, "Probabilistic Frequent Itemset Mining in Uncertain Databases," Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD), 2009.
[6] G. Beskales, M.A. Soliman, and I.F. Ilyas, "Efficient Search for the Top-k Probable Nearest Neighbors in Uncertain Databases," Proc. VLDB Endowment, vol. 1, pp. 326-339, 2008.
[7] S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani, "Robust and Efficient Fuzzy Match for Online Data Cleaning," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD), 2003.
[8] R. Cheng, D. Kalashnikov, and S. Prabhakar, "Evaluating Probabilistic Queries over Imprecise Data," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD), 2003.
[9] N. Dalvi and D. Suciu, "Efficient Query Evaluation on Probabilistic Databases," VLDB J., vol. 16, no. 4, pp. 523-544, 2007.
[10] A. Deshpande, C. Guestrin, S. Madden, J. Hellerstein, and W. Hong, "Model-Driven Data Acquisition in Sensor Networks," Proc. Int'l Conf. Very Large Data Bases (VLDB), 2004.
[11] R. Fagin, A. Lotem, and M. Naor, "Optimal Aggregation Algorithms for Middleware," Proc. ACM SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems (PODS), 2001.
[12] R. Fagin, R. Kumar, and D. Sivakumar, "Comparing Top k Lists," Proc. ACM-SIAM Symp. Discrete Algorithms, 2003.
[13] A. Fuxman, E. Fazli, and R.J. Miller, "ConQuer: Efficient Management of Inconsistent Databases," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD), 2005.
[14] T. Ge, S. Zdonik, and S. Madden, "Top-k Queries on Uncertain Data: On Score Distribution and Typical Answers," Proc. SIGMOD Int'l Conf. Management of Data (SIGMOD), 2009.
[15] A. Halevy, A. Rajaraman, and J. Ordille, "Data Integration: The Teenage Year," Proc. Int'l Conf. Very Large Data Bases (VLDB), 2006.
[16] M.A. Hernandez and S.J. Stolfo, "Real-World Data Is Dirty: Data Cleansing and the Merge/Purge Problem," Data Mining and Knowledge Discovery, vol. 2, no. 1, pp. 9-37, 1998.
[17] M. Hua, J. Pei, W. Zhang, and X. Lin, "Efficiently Answering Probabilistic Threshold Top-k Queries on Uncertain Data," Proc. IEEE Int'l Conf. Data Eng. (ICDE), 2008.
[18] M. Hua, J. Pei, W. Zhang, and X. Lin, "Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD), 2008.
[19] I.F. Ilyas, W.G. Aref, A.K. Elmagarmid, H. Elmongui, R. Shah, and J.S. Vitter, "Adaptive Rank-Aware Query Optimization in Relational Databases," ACM Trans. Database Systems, vol. 31, pp. 1257-1304, 2006.
[20] I.F. Ilyas, G. Beskales, and M.A. Soliman, "Survey of Top-k Query Processing Techniques in Relational Database Systems," ACM Computing Surveys, vol. 40, pp. 1-58, 2008.
[21] R. Jampani, F. Xu, M. Wu, L.L. Perez, C.M. Jermaine, and P.J. Haas, "MCDB: A Monte Carlo Approach to Managing Uncertain Data," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD), 2008.
[22] B. Kanagal and A. Deshpande, "Online Filtering, Smoothing and Probabilistic Modeling of Streaming Data," Proc. IEEE Int'l Conf. Data Eng. (ICDE), 2008.
[23] C. Li, K.C.-C. Chang, I. Ilyas, and S. Song, "RankSQL: Query Algebra and Optimization for Relational Top-k Queries," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD), 2005.
[24] J. Li, B. Saha, and A. Deshpande, "A Unified Approach to Ranking in Probabilistic Databases," Proc. VLDB Endowment, vol. 2, no. 1, pp. 502-513, 2009.
[25] X. Lian and L. Chen, "Probabilistic Ranked Queries in Uncertain Databases," Proc. Int'l Conf. Extending Database Technology: Advances in Database Technology (EDBT), 2008.
[26] V. Ljosa and A. Singh, "APLA: Indexing Arbitrary Probability Distributions," Proc. IEEE Int'l Conf. Data Eng. (ICDE), 2007.
[27] V. Ljosa and A.K. Singh, "Top-k Spatial Joins of Probabilistic Objects," Proc. IEEE Int'l Conf. Data Eng. (ICDE), 2008.
[28] J. Pei, B. Jiang, X. Lin, and Y. Yuan, "Probabilistic Skylines on Uncertain Data," Proc. Int'l Conf. Very Large Data Bases (VLDB), 2007.
[29] C. Re, N. Dalvi, and D. Suciu, "Efficient Top-k Query Evaluation on Probabilistic Databases," Proc. IEEE Int'l Conf. Data Eng. (ICDE), 2007.
[30] A.D. Sarma, O. Benjelloun, A. Halevy, and J. Widom, "Working Models for Uncertain Data," Proc. IEEE Int'l Conf. Data Eng. (ICDE), 2006.
[31] P. Sen and A. Deshpande, "Representing and Querying Correlated Tuples in Probabilistic Databases," Proc. IEEE Int'l Conf. Data Eng. (ICDE), 2007.
[32] J.G. Shanthikumar and M. Shaked, Stochastic Orders and Their Applications. Academic Press, 1994.
[33] S. Singh, C. Mayfield, S. Mittal, S. Prabhakar, S. Hambrusch, and R. Shah, "Orion 2.0: Native Support for Uncertain Data," SIGMOD '08: Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 1239-1242, 2008.
[34] S. Singh, C. Mayfield, S. Mittal, S. Prabhakar, S.E. Hambrusch, and R. Shah, "The Orion Uncertain Data Management System," Proc. Int'l Conf. Management of Data (COMAD), 2008.
[35] S. Singh, C. Mayfield, S. Prabhakar, R. Shah, and S. Hambrusch, "Indexing Uncertain Categorical Data," Proc. IEEE Int'l Conf. Data Eng. (ICDE), 2007.
[36] M.A. Soliman and I.F. Ilyas, "Ranking with Uncertain Scores," Proc. IEEE Int'l Conf. Data Eng. (ICDE), 2009.
[37] M.A. Soliman, I.F. Ilyas, and K.C.-C. Chang, "Top-k Query Processing in Uncertain Databases," Proc. IEEE Int'l Conf. Data Eng. (ICDE), 2007.
[38] M.A. Soliman, I.F. Ilyas, and K.C.-C. Chang, "Probabilistic Top-k and Ranking-Aggregate Queries," ACM Trans. Database Systems, vol. 33, no. 3, pp. 1-54, 2008.
[39] Y. Tao, R. Cheng, X. Xiao, W.K. Ngai, B. Kao, and S. Prabhakar, "Indexing Multi-Dimensional Uncertain Data with Arbitrary Probability Density Functions," Proc. Int'l Conf. Very Large Data Bases (VLDB), 2005.
[40] D. Xin, J. Han, and K.C.-C. Chang, "Progressive and Selective Merge: Computing Top-k with Ad-Hoc Ranking Functions," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD), 2007.
[41] K. Yi, F. Li, D. Srivastava, and G. Kollios, "Efficient Processing of Top-k Queries in Uncertain Databases with x-Relations," IEEE Trans. Knowledge and Data Eng., vol. 20, no. 12, pp. 1669-1682, Dec. 2008.
[42] Q. Zhang, F. Li, and K. Yi, "Finding Frequent Items in Probabilistic Data," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD), 2008.
[43] X. Zhang and J. Chomicki, "On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases," Proc. DBRank, 2008.
19 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool