Subscribe

Issue No.09 - Sept. (2012 vol.24)

pp: 1640-1657

Fabrizio Angiulli , University of Calabria, Rende

Fabio Fassetti , University of Calabria, Rende

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.93

ABSTRACT

In this study, we deal with the problem of efficiently answering range queries over uncertain objects in a general metric space. In this study, an uncertain object is an object that always exists but its actual value is uncertain and modeled by a multivariate probability density function. As a major contribution, this is the first work providing an effective technique for indexing uncertain objects coming from general metric spaces. We generalize the reverse triangle inequality to the probabilistic setting in order to exploit it as a discard condition. Then, we introduce a novel pivot-based indexing technique, called UP-index, and show how it can be employed to speed up range query computation. Importantly, the candidate selection phase of our technique is able to noticeably reduce the set of candidates with little time requirements. Finally, we provide a criterion to measure the quality of a set of pivots and study the problem of selecting a good set of pivots according to the introduced criterion. We report some intractability results and then design an approximate algorithm with statistical guarantees for selecting pivots. Experimental results validate the effectiveness of the proposed approach and reveal that the introduced technique may be even preferable to indexing techniques specifically designed for the euclidean space.

INDEX TERMS

uncertain data, Indexing, metric spaces

CITATION

Fabrizio Angiulli, Fabio Fassetti, "Indexing Uncertain Data in General Metric Spaces",

*IEEE Transactions on Knowledge & Data Engineering*, vol.24, no. 9, pp. 1640-1657, Sept. 2012, doi:10.1109/TKDE.2011.93REFERENCES

- [1] R. Cheng, Y. Xia, S. Prabhakar, R. Shah, and J.S. Vitter, "Efficient Indexing Methods for Probabilistic Threshold Queries over Uncertain Data,"
Proc. 13th Int'l Conf. Very Large Data Bases (VLDB '04), pp. 876-887, 2004.- [2] Y. Tao, R. Cheng, X. Xiao, W.K. Ngai, B. Kao, and S. Prabhakar, "Indexing Multi-Dimensional Uncertain Data with Arbitrary Probability Density Functions,"
Proc. 31st Int'l Conf. Very Large Data Bases (VLDB '05), pp. 922-933, 2005.- [3] T. Green and V. Tannen, "Models for Incomplete and Probabilistic Information,"
IEEE Data Eng. Bull., vol. 29, no. 1, pp. 17-24, 2006.- [4] Y. Tao, X. Xiao, and R. Cheng, "Range Search on Multidimensional Uncertain Data,"
ACM Trans. Database Systems, vol. 32, no. 3, pp. 1-54, 2007.- [5] C. Aggarwal and P. Yu, "A Survey of Uncertain Data Algorithms and Applications,"
IEEE Trans. Knowledge Data Eng., vol. 21, no. 5, pp. 609-623, May 2009.- [6] P.K. Agarwal, S.-W. Cheng, Y. Tao, and K. Yi, "Indexing Uncertain Data,"
Proc. 28th ACM SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems (PODS '09), pp. 137-146, 2009.- [7] C.C. Aggarwal,
Managing and Mining Uncertain Data, ser. Advances in Database Systems, Springer, vol. 35, 2009.- [8] E. Chávez, G. Navarro, R. Baeza-Yates, and J. Marroquín, "Searching in Metric Spaces,"
ACM Computing Surveys, vol. 33, no. 3, pp. 273-321, 2001.- [9] H. Samet,
Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling). Morgan Kaufmann Publishers, Inc., 2005.- [10] P. Zezula, G. Amato, V. Dohnal, and M. Batko,
Similarity Search: The Metric Space Approach, ser. Advances in Database Systems. Springer, vol. 32, 2006.- [11] V. Levenshtein, "Binary Codes Capable of Correcting Deletions, Insertions, and Reversals,"
Soviet Physics Doklady, vol. 10, pp. 707-710, 1966.- [12] H.-P. Kriegel, P. Kunath, M. Pfeifle, and M. Renz, "Probabilistic Similarity Join on Uncertain Data,"
Proc. Int'l Conf. Database Systems for Advanced Applications, pp. 295-309, 2006.- [13] M. Chau, R. Cheng, B. Kao, and J. Ng, "Uncertain Data Mining: An Example in Clustering Location Data,"
Proc. 10th Pacific-Asia Conf. Knowledge Discovery and Data Mining (PAKDD '06), pp. 199-204, 2006.- [14] S. Łukaszyk, "A New Concept of Probability Metric and its Applications in Approximation of Scattered Data Sets,"
Computational Mechanics, vol. 33, no. 4, pp. 299-304, 2004.- [15] W. Ngai, B. Kao, C. Chui, R. Cheng, M. Chau, and K. Yip, "Efficient Clustering of Uncertain Data,"
Proc. Int'l Conf. Data Mining, pp. 436-445, 2006.- [16] S. Singh, C. Mayfield, S. Prabhakar, R. Shah, and S. Hambrusch, "Indexing Uncertain Categorical Data,"
Proc. IEEE 23rd Int'l Conf. Data Eng. (ICDE), pp. 616-625, Apr. 2007.- [17] Y. Zhang, X. Lin, W. Zhang, J. Wang, and Q. Lin, "Effectively Indexing the Uncertain Space,"
IEEE Trans. Knowledge Data Eng., vol. 22, no. 9, pp. 1247-1261, Sept. 2010.- [18] J. Bentley, "Multidimensional Binary Search Trees Used for Associative Searching,"
Comm. ACM, vol. 18, pp. 509-517, 1975.- [19] N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger, "The R∗-Tree: An Efficient and Robust Access Method for Points and Rectangles,"
Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '90), pp. 322-331, 1990.- [20] S. Berchtold, D. Keim, and H.-P. Kriegel, "The X-tree: An Index Structure for High-Dimensional Data,"
Proc. 22th Int'l Conf. Very Large Data Bases (VLDB '96), pp. 28-39, 1996.- [21] P.N. Yianilos, "Data Structures and Algorithms for Nearest Neighbor Search in General Metric Spaces,"
Proc. ACM-SIAM Symp. Discrete Algorithms (SODA '93), pp. 311-321, 1993.- [22] L. Micó, J. Oncina, and E. Vidal, "A New Version of the Nearest-Neighbour Approximating and Eliminating Search Algorithm (Aesa) with Linear Preprocessing Time and Memory Requirements,"
Pattern Recognition Letters, vol. 15, no. 1, pp. 9-17, 1994.- [23] E. Chávez, J.L. Marroquín, and R.A. Baeza-Yates, "Spaghettis: An Array Based Algorithm for Similarity Queries in Metric Spaces,"
Proc. Symp. String Processing and Information Retrieval (SPIRE), pp. 38-46, 1999.- [24] B. Bustos, G. Navarro, and E. Chávez, "Pivot Selection Techniques for Proximity Searching in Metric Spaces,"
Pattern Recognition Letters, vol. 24, no. 14, pp. 2357-2366, 2003.- [25] B. Bustos, O. Pedreira, and N.R. Brisaboa, "A Dynamic Pivot Selection Technique for Similarity Search,"
Proc. IEEE 24th Int'l Conf. Data Eng. Workshop (ICDEW), pp. 394-401, 2008.- [26] L. Ares, N. Brisaboa, M. Esteller, O. Pedreira, and A. Places, "Optimal Pivots to Minimize the Index Size for Metric Access Methods,"
Proc. Int'l Workshop Similarity Search and Applications (SISAP), pp. 74-80, 2009.- [27] J.F. Traub, G.W. Wasilkowski, and H. Woźniakowski,
Information-Based Complexity. Academic Press Professional, Inc., 1988.- [28] N. Metropolis and S. Ulam, "The Monte Carlo Method,"
J. Am. Statistical Assoc., vol. 44, no. 247, pp. 335-341, 1949.- [29] P. Davis and P. Rabinowitz,
Methods of Numerical Integration. Dover, 1984.- [30] A. Frank and A. Asuncion, "UCI Machine Learning Repository," http://archive.ics.uci.eduml, 2010.
- [31] R. Finn, J. Mistry, J. Tate, P. Coggill, A. Heger, J. Pollington, O. Gavin, P. Gunesekaran, G. Ceric, K. Forslund, L. Holm, E. Sonnhammer, S. Eddy, and A. Bateman, "The Pfam Protein Families Database,"
Nucleic Acids Research, vol. 38, no. Database Issue, pp. D211-D222, 2010.- [32] F.J. Hickernell and H. Wozniakowski, "Integration and Approximation in Arbitrary Dimensions,"
Advances in Computational Math., vol. 12, no. 1, pp. 25-58, 2000.- [33] M.R. Garey and D.S. Johnson,
Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, 1979.- [34] W. Hoeffding, "Probability Inequalities for Sums of Bounded Random Variables,"
J. Am. Statistical Assoc., vol. 58, no. 301, pp. 13-30, 1963. |