Subscribe

Issue No.11 - November (2011 vol.23)

pp: 1718-1734

Xiang Lian , Hong Kong University of Science and Technology, Hong Kong

Lei Chen , Hong Kong University of Science and Technology, Hong Kong

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2010.208

ABSTRACT

Similarity join processing in the streaming environment has many practical applications such as sensor networks, object tracking and monitoring, and so on. Previous works usually assume that stream processing is conducted over precise data. In this paper, we study an important problem of similarity join processing on stream data that inherently contain uncertainty (or called uncertain data streams), where the incoming data at each time stamp are uncertain and imprecise. Specifically, we formalize this problem as join on uncertain data streams (USJ), which can guarantee the accuracy of USJ answers over uncertain data. To tackle the challenges with respect to efficiency and effectiveness such as limited memory and small response time, we propose effective pruning methods on both object and sample levels to filter out false alarms. We integrate the proposed pruning methods into an efficient query procedure that can incrementally maintain the USJ answers. Most importantly, we further design a novel strategy, namely, adaptive superset prejoin (ASP), to maintain a superset of USJ candidate pairs. ASP is in light of our proposed formal cost model such that the average USJ processing cost is minimized. We have conducted extensive experiments to demonstrate the efficiency and effectiveness of our proposed approaches.

INDEX TERMS

Join on uncertain data streams, adaptive superset prejoin.

CITATION

Xiang Lian, Lei Chen, "Similarity Join Processing on Uncertain Data Streams",

*IEEE Transactions on Knowledge & Data Engineering*, vol.23, no. 11, pp. 1718-1734, November 2011, doi:10.1109/TKDE.2010.208REFERENCES

- [1] G. Beskales, M. Soliman, and I.F. Ilyas, "Efficient Search for the Top-K Probable Nearest Neighbors in Uncertain Databases,"
Proc. 34th Int'l Conf. Very Large Data Bases (VLDB), 2008.- [2] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, "When Is 'Nearest Neighbor' Meaningful?,"
Proc. Int'l Conf. Database Theory, 1999.- [3] C. Böhm, A. Pryakhin, and M. Schubert, "The Gauss-Tree: Efficient Object Identification in Databases of Probabilistic Feature Vectors,"
Proc. 22nd Int'l Conf. Data Eng. (ICDE), 2006.- [4] T. Brinkhoff, H-P. Kriegel, and B. Seeger, "Efficient Processing of Spatial Joins Using R-Trees,"
Proc. ACM SIGMOD Int'l Conf. Management of Data, 1993.- [5] D. Carney, U. Çetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul, and S.B. Zdonik, "Monitoring Streams - A New Class of Data Management Applications,"
Proc. 28th Int'l Conf. Very Large Data Bases (VLDB), 2002.- [6] L. Chen, M.T. Özsu, and V. Oria, "Robust and Fast Similarity Search for Moving Object Trajectories,"
Proc. ACM SIGMOD Int'l Conf. Management of Data, 2005.- [7] R. Cheng, D. Kalashnikov, and S. Prabhakar, "Querying Imprecise Data in Moving Object Environments,"
IEEE Trans. Knowledge and Data Eng., vol. 16, no. 9, pp. 1112-1127, Sept. 2004.- [8] R. Cheng, D.V. Kalashnikov, and S. Prabhakar, "Evaluating Probabilistic Queries over Imprecise Data,"
Proc. ACM SIGMOD Int'l Conf. Management of Data, 2003.- [9] R. Cheng, S. Singh, S. Prabhakar, R. Shah, J.S. Vitter, and Y. Xia, "Efficient Join Processing over Uncertain Data,"
Proc. 15th Int'l Conf. Information and Knowledge Management, 2006.- [10] G. Cormode and M. Garofalakis, "Sketching Probabilistic Data Streams,"
Proc. ACM SIGMOD Int'l Conf. Management of Data, 2007.- [11] G. Cormode, F. Li, and K. Yi, "Semantics of Ranking Queries for Probabilistic Data and Expected Ranks,"
Proc. 25th Int'l Conf. Data Eng. (ICDE), 2009.- [12] N.N. Dalvi and D. Suciu, "Efficient Query Evaluation on Probabilistic Databases,"
The VLDB J., vol. 16, no. 4, pp. 523-544, 2007.- [13] G. Das, D. Gunopulos, N. Koudas, and N. Sarkas, "Ad-Hoc Top-$k$ Query Answering for Data Streams,"
Proc. 33rd Int'l Conf. Very Large Data Bases (VLDB), 2007.- [14] Y. Diao, B. Li, A. Liu, L. Peng, C. Sutton, T. Tran, and M. Zink, "Capturing Data Uncertainty in High-Volume Stream Processing,"
Proc. Conf. Innovative Data Systems Research, 2009.- [15] A. Faradjian, J. Gehrke, and P. Bonnet, "Gadt: A Probability Space ADT for Representing and Querying the Physical World,"
Proc. 18th Int'l Conf. Data Eng. (ICDE), 2002.- [16] B. Gedik and L. Liu, "Location Privacy in Mobile Systems: A Personalized Anonymization Model,"
Proc. 25th Int'l Conf. Distributed Computing Systems, 2005.- [17] A.C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M.J. Strauss, "Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries,"
Proc. 27th Int'l Conf. Very Large Data Bases (VLDB), 2001.- [18] C.M. Grinstead and J.L. Snell,
Introduction to Probability. AMS, 1997.- [19] A. Guttman, "R-Trees: A Dynamic Index Structure for Spatial Searching,"
Proc. ACM SIGMOD Int'l Conf. Management of Data, 1984.- [20] C. Heinz and B. Seeger, "Exploring Data Streams with Nonparametric Estimators,"
Proc. Int'l Conf. Scientific and Statistical Database Management (SSDBM), 2006.- [21] M. Hua, J. Pei, W. Zhang, and X. Lin, "Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach,"
Proc. ACM SIGMOD Int'l Conf. Management of Data, 2008.- [22] Y. Ishikawa, Y. Iijima, and J.X. Yu, "Processing Spatial Range Queries for Objects with Imprecise Gaussian-Based Location Information,"
Proc. 25th Int'l Conf. Data Eng. (ICDE), 2009.- [23] T.S. Jayram, A. McGregor, S. Muthukrishnan, and E. Vee, "Estimating Statistical Aggregates on Probabilistic Data Streams,"
ACM Trans. Database Systems, vol. 33, no. 4, pp. 1-30, 2008.- [24] S.R. Jeffery, M.J. Franklin, and M. Garofalakis, "An Adaptive RFID Middleware for Supporting Metaphysical Data Independence,"
The VLDB J., vol. 17, no. 2, pp. 265-289, 2008.- [25] S.R. Jeffery, M.N. Garofalakis, and M.J. Franklin, "Adaptive Cleaning for RFID Data Streams,"
Proc. 32nd Int'l Conf. Very Large Data Bases (VLDB), 2006.- [26] C. Jin, K. Yi, L. Chen, J.X. Yu, and X. Lin, "Space-Efficient Synopses for Sliding-Window Top-K Queries on Uncertain Streams,"
Proc. 34th Int'l Conf. Very Large Data Bases (VLDB), 2008.- [27] G. Jovanovic-Dolecek, "Demo Program for Central Limit Theorem,"
Proc. Midwest Symp. Circuits and Systems, 1997.- [28] J. Kang, J.F. Naughton, and S.D. Viglasg, "Evaluating Window Joins over Unbounded Streams,"
Proc. 19th Int'l Conf. Data Eng. (ICDE), 2003.- [29] E.M. Knorr and R.T. Ng, "Algorithms for Mining Distance-Based Outliers in Large Datasets,"
Proc. 24th Int'l Conf. Very Large Data Bases (VLDB), 1998.- [30] H.-P. Kriegel, P. Kunath, M. Pfeifle, and M. Renz, "Probabilistic Similarity Join on Uncertain Data,"
Proc. Int'l Conf. Database Systems for Advanced Applications, 2006.- [31] H.-P. Kriegel, P. Kunath, and M. Renz, "Probabilistic Nearest-Neighbor Query on Uncertain Objects,"
Proc. Int'l Conf. Database Systems for Advanced Applications, 2007.- [32] L.V.S. Lakshmanan, N. Leone, R. Ross, and V.S. Subrahmanian, "ProbView: A Flexible Probabilistic Database System,"
ACM Trans. Database Systems, vol. 22, no. 3, pp. 419-469, 1997.- [33] J. Li, B. Saha, and A. Deshpande, "A Unified Approach to Ranking in Probabilistic Databases,"
Proc. 35th Int'l Conf. Very Large Data Bases (VLDB), 2009.- [34] M. Li and Y. Liu, "Underground Coal Mine Monitoring with Wireless Sensor Networks,"
ACM Trans. Sensor Networks, vol. 5, pp. 1-29, 2009.- [35] X. Lian and L. Chen, "Monochromatic and Bichromatic Reverse Skyline Search over Uncertain Databases,"
Proc. ACM SIGMOD Int'l Conf. Management of Data, 2008.- [36] X. Lian and L. Chen, "Probabilistic Group Nearest Neighbor Queries in Uncertain Databases,"
IEEE Trans. Knowledge and Data Eng., vol. 20, no. 6, pp. 809-824, June 2008.- [37] X. Lian and L. Chen, "Probabilistic Ranked Queries in Uncertain Databases,"
Proc. Int'l Conf. Extending Database Technology (EDBT), 2008.- [38] X. Lian and L. Chen, "Efficient Processing of Probabilistic Reverse Nearest Neighbor Queries over Uncertain Data,"
The VLDB J., vol. 18, pp. 787-808, 2009.- [39] X. Lian and L. Chen, "Probabilistic Inverse Ranking Queries over Uncertain Data,"
Proc. Int'l Conf. Database Systems for Advanced Applications, 2009.- [40] X. Lian and L. Chen, "Top-$k$ Dominating Queries in Uncertain Databases,"
Proc. Int'l Conf. Extending Database Technology (EDBT), 2009.- [41] V. Ljosa and A.K. Singh, "Top-$k$ Spatial Joins of Probabilistic Objects,"
Proc. 24th Int'l Conf. Data Eng. (ICDE), 2008.- [42] M.L. Lo and C.V. Ravishankar, "Spatial Hash-Joins,"
ACM SIGMOD Record, vol. 25, pp. 247-258, 1996.- [43] M.F. Mokbel, C.-Y. Chow, and W.G. Aref, "The New Casper: Query Processing for Location Services without Compromising Privacy,"
Proc. 32nd Int'l Conf. Very Large Data Bases (VLDB), 2006.- [44] M.F. Mokbel, M. Lu, and W.G. Aref, "Hash-Merge Join: A Non-Blocking Join Algorithm for Producing Fast and Early Join Results,"
Proc. 20th Int'l Conf. Data Eng. (ICDE), 2004.- [45] J. Pei, B. Jiang, X. Lin, and Y. Yuan, "Probabilistic Skylines on Uncertain Data,"
Proc. 33rd Int'l Conf. Very Large Data Bases (VLDB), 2007.- [46] S. Singh, C. Mayfield, R. Shah, S. Prabhakar, S. Hambrusch, J. Neville, and R. Cheng, "Database Support for Probabilistic Attributes and Tuples,"
Proc. 24th Int'l Conf. Data Eng. (ICDE), 2008.- [47] M.A. Soliman, I.F. Ilyas, and K.C. Chang, "Top-$k$ Query Processing in Uncertain Databases,"
Proc. 23rd Int'l Conf. Data Eng. (ICDE), 2007.- [48] Y. Tao, R. Cheng, X. Xiao, W.K. Ngai, B. Kao, and S. Prabhakar, "Indexing Multi-Dimensional Uncertain Data with Arbitrary Probability Density Functions,"
Proc. 31st Int'l Conf. Very Large Data Bases (VLDB), 2005.- [49] Y. Tao and D. Papadias, "Maintaining Sliding Window Skylines on Data Streams,"
IEEE Trans. Knowledge and Data Eng., vol. 18, no. 3, pp. 377-391, Mar. 2006.- [50] Y.F. Tao, M.L. Yiu, D. Papadias, M. Hadjieleftheriou, and N. Mamoulis, "RPJ: Producing Fast Join Results on Streams through Rate-Based Optimization,"
Proc. ACM SIGMOD Int'l Conf. Management of Data, 2005.- [51] T. Urhan and M.J. Franklin, "Xjoin: A Reactively-Scheduled Pipelined Join Operator,"
IEEE Data Eng. Bull., vol. 23, no. 2, pp. 27-33, June 2000.- [52] E.W. Weisstein, "Central Limit Theorem," http://mathworld. wolfram.comCentralLimitTheorem.html , 2011.
- [53] W. Xue, Q. Luo, L. Chen, and Y. Liu, "Contour Map Matching for Event Detection in Sensor Networks,"
Proc. ACM SIGMOD Int'l Conf. Management of Data, 2006.- [54] Y.-W. Huang, N. Jing, and E.A. Rundensteiner, "Spatial Joins Using R-Trees: Breadth-First Traversal with Global Optimizations,"
Proc. 23rd Int'l Conf. Very Large Data Bases (VLDB), 1997.- [55] Z. Yang and Y. Liu, "Quality of Trilateration: Confidence-Based Iterative Localization,"
IEEE Trans. Parallel and Distributed Systems, vol. 21, no. 5, pp. 631-640, May 2010.- [56] M.-Y. Yeh, P.S. Yu, K.-L. Wu, and M.-S. Chen, "PROUD: A Probabilistic Approach to Processing Similarity Queries over Uncertain Data Streams,"
Proc. Int'l Conf. Extending Database Technology, 2009.- [57] Q. Zhang, F. Li, and K. Yi, "Finding Frequent Items in Probabilistic Data,"
Proc. ACM SIGMOD Int'l Conf. Management of Data, 2008.- [58] W. Zhang, X. Lin, Y. Zhang, W. Wang, and J.X. Yu, "Probabilistic Skyline Operator over Sliding Windows,"
Proc. 25th Int'l Conf. Data Eng. (ICDE), 2009.- [59] Y. Zhu and D. Shasha, "StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time,"
Proc. 28th Int'l Conf. Very Large Data Bases (VLDB), 2002. |