The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.03 - March (2014 vol.26)
pp: 725-738
Wei Cheng , University of North Carolina at Chapel Hill, Chapel Hill
Xiaoming Jin , Tsinghua University, Beijing
Jian-Tao Sun , Microsoft Research Asia, Beijing
Xuemin Lin , The University of New South Wales, Sydney
Xiang Zhang , Case Western Reserve University, Cleveland
Wei Wang , University of California, Los Angeles, Los Angeles
ABSTRACT
Similarity query is a fundamental problem in database, data mining and information retrieval research. Recently, querying incomplete data has attracted extensive attention as it poses new challenges to traditional querying techniques. The existing work on querying incomplete data addresses the problem where the data values on certain dimensions are unknown. However, in many real-life applications, such as data collected by a sensor network in a noisy environment, not only the data values but also the dimension information may be missing. In this work, we propose to investigate the problem of similarity search on dimension incomplete data. A probabilistic framework is developed to model this problem so that the users can find objects in the database that are similar to the query with probability guarantee. Missing dimension information poses great computational challenge, since all possible combinations of missing dimensions need to be examined when evaluating the similarity between the query and the data objects. We develop the lower and upper bounds of the probability that a data object is similar to the query. These bounds enable efficient filtering of irrelevant data objects without explicitly examining all missing dimension combinations. A probability triangle inequality is also employed to further prune the search space and speed up the query process. The proposed probabilistic framework and techniques can be applied to both whole and subsequence queries. Extensive experimental results on real-life data sets demonstrate the effectiveness and efficiency of our approach.
INDEX TERMS
Random variables, Upper bound, Probabilistic logic, Educational institutions, Query processing, Time series analysis,whole sequence query, Dimension incomplete database, similarity search
CITATION
Wei Cheng, Xiaoming Jin, Jian-Tao Sun, Xuemin Lin, Xiang Zhang, Wei Wang, "Searching Dimension Incomplete Databases", IEEE Transactions on Knowledge & Data Engineering, vol.26, no. 3, pp. 725-738, March 2014, doi:10.1109/TKDE.2013.14
REFERENCES
[1] C. Faloutsos, M. Ranganathan, and Y. Manolopoulos, "Fast Subsequence Matching in Time-Series Databases," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '94), pp. 419-429, 1994.
[2] M. Ankerst, B. Braunmller, H.-P. Kriegel, and T. Seidl, "Improving Adaptable Similarity Query Processing by Using Approximations," Proc. 24th Int'l Conf. Very Large Data Bases (VLDB '98), pp. 206-217, 1998.
[3] R. Agrawal, C. Faloutsos, and A.N. Swami, "Efficient Similarity Search in Sequence Databases," Proc. Fourth Int'l Conf. Foundations of Data Organization and Algorithms (FODO '93), pp. 69-84, 1993.
[4] R. Fagin, R. Kumar, and D. Sivakumar, "Efficient Similarity Search and Classification via Rank Aggregation," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '03), pp. 301-312, 2003.
[5] C.C. Aggarwal and S. Parthasarathy, "Mining Massively Incomplete Data Sets by Conceptual Reconstruction," Proc. Seventh ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (SIGKDD '01), pp. 227-232, 2001.
[6] D. Burdick, P.M. Deshpande, T.S. Jayram, R. Ramakrishnan, and S. Vaithyanathan, "Olap over Uncertain and Imprecise Data," Proc. Int'l Conf. Very Large Databases (VLDB '05), pp. 970-981, 2005.
[7] G. Canahuate, M. Gibas, and H. Ferhatosmanoglu, "Indexing Incomplete Database," Proc. 10th Int'l Conf. Advances in Database Technology (EDBT '06), pp. 884-901, 2006.
[8] J. Gu and X. Jin, "Similarity Search over Incomplete Symbolic Sequences," Proc. 18th Int'l Conf. Database and Expert Systems Applications (DEXA '07), pp. 339-348, 2007.
[9] H. Zhang, Y. Diao, and N. Immerman, "Recognizing Patterns in Streams with Imprecise Timestamps," Proc. VLDB Endowment, vol. 3, pp. 244-255, 2010.
[10] R. Cheng, D.V. Kalashnikov, and S. Prabhakar, "Evaluating Probabilistic Queries over Imprecise Data," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '03), pp. 551-562, 2003.
[11] M. Hua, J. Pei, W. Zhang, and X. Lin, "Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '08), pp. 673-686, 2008.
[12] E. Keogh and M. Pazzani, "Scaling up Dynamic Time Warping to Massive Data Sets," Proc. Third European Conf. Principles of Data Mining and Knowledge Discovery (ECML/PKDD '99), pp. 1-11, 1999.
[13] B. Bollobas, G. Das, D. Gunopulos, and H. Mannila, "Time-Series Similarity Problems and Well-Separated Geometric Sets," Proc. 13th Ann. Symp. Computational Geometry (SCG '97), pp. 454-456, 1997.
[14] D. Gu and Y. Gao, "Incremental Gradient Descent Imputation Method for Missing Data in Learning Classifier Systems," Proc. Workshops Genetic and Evolutionary Computation (GECCO '05), pp. 72-73, 2005.
[15] R.K. Pearson, "The Problem of Disguised Missing Data," ACM SIGKDD Explorations Newsletter, vol. 8, pp. 83-92, 2006.
[16] I. Wasito and B. Mirkin, "Nearest Neighbour Approach in the Least-Squares Data Imputation Algorithms," Information Sciences: An Int'l J., vol. 169, pp. 1-25, 2005.
[17] J. Pei, B. Jiang, X. Lin, and Y. Yuan, "Probabilistic Skylines on Uncertain Data," Proc. 33rd Int'l Conf. Very Large Databases (VLDB '07), pp. 15-26, 2007.
[18] J. Pei, M. Hua, Y. Tao, and X. Lin, "Query Answering Techniques on Uncertain and Probabilistic Data: Tutorial Summary," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '08), pp. 1357-1364, 2008.
[19] E. Keogh, "Exact Indexing of Dynamic Time Warping," Proc. 28th Int'l Conf. Very Large Data Bases (VLDB '02), pp. 406-417, 2002.
[20] G. Navarro, "A Guided Tour to Approximate String Matching," ACM Computing Surveys, vol. 33, pp. 31-88, 2001.
[21] R.A. Little and D.B. Rubin, Statistical Analysis with Missing Data, Wiley Series in Probability and Statistics, first ed., pp. 2-278. John Wiley & Sons, 1987.
[22] T. Mathew and K. Nordstrom, "Inequalities for the Probability Content of a Rotated Ellipse and Related Stochastic Domination Results," The Annals of Applied Probability, vol. 7, no. 4, pp. 1106-1117, 1997.
23 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool