Subscribe

Issue No.11 - November (2009 vol.21)

pp: 1544-1558

Xiang Lian , Hong Kong University of Science and Technology, Hong Kong

Lei Chen , Hong Kong University of Science and Technology, Hong Kong

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2009.27

ABSTRACT

Similarity join (SJ) in time-series databases has a wide spectrum of applications such as data cleaning and mining. Specifically, an SJ query retrieves all pairs of (sub)sequences from two time-series databases that \varepsilon-match with each other, where \varepsilon is the matching threshold. Previous work on this problem usually considers static time-series databases, where queries are performed either on disk-based multidimensional indexes built on static data or by nested loop join (NLJ) without indexes. SJ over multiple stream time series, which continuously outputs pairs of similar subsequences from stream time series, strongly requires low memory consumption, low processing cost, and query procedures that are themselves adaptive to time-varying stream data. These requirements invalidate the existing approaches in static databases. In this paper, we propose an efficient and effective approach to perform SJ among multiple stream time series incrementally. In particular, we present a novel method, Adaptive Radius-based Search (ARES), which can answer the similarity search without false dismissals and is seamlessly integrated into SJ processing. Most importantly, we provide a formal cost model for ARES, based on which ARES can be adaptive to data characteristics, achieving the minimum number of refined candidate pairs, and thus, suitable for stream processing. Furthermore, in light of the cost model, we utilize space-efficient synopses that are constructed for stream time series to further reduce the candidate set. Extensive experiments demonstrate the efficiency and effectiveness of our proposed approach.

INDEX TERMS

Stream time series, ARES, similarity join, synopsis.

CITATION

Xiang Lian, Lei Chen, "Efficient Similarity Join over Multiple Stream Time Series",

*IEEE Transactions on Knowledge & Data Engineering*, vol.21, no. 11, pp. 1544-1558, November 2009, doi:10.1109/TKDE.2009.27REFERENCES

- [1] R. Agrawal, C. Faloutsos, and A.N. Swami, “Efficient Similarity Search in Sequence Databases,”
Proc. Fourth Int'l Conf. Foundations of Data Organization and Algorithms (FODO), 1993.- [3] S. Berchtold, C. Böhm, D.A. Keim, and H.-P. Kriegel, “A Cost Model for Nearest Neighbor Search in High-Dimensional Data Space,”
Proc. 16th ACM SIGACT-SIGMOD-SIGART Symp. Principles of Database Systems (PODS), 1997.- [4] S. Berchtold, D.A. Keim, and H.P. Kriegel, “The X-Tree: An Index Structure for High-Dimensional Data,”
Proc. 22nd Int'l Conf. Very Large Data Bases (VLDB), 1996.- [5] D.J. Berndt and J. Clifford, “Finding Patterns in Time Series: A Dynamic Programming Approach,”
Advances in Knowledge Discovery and Data Mining, Am. Assoc. for Artificial Intelligence, 1996.- [6] T. Brinkhoff, H.-P. Kriegel, and B. Seeger, “Efficient Processing of Spatial Joins Using R-Trees,”
Proc. ACM SIGMOD, 1993.- [7] A. Bulut and A.K. Singh, “A Unified Framework for Monitoring Data Streams in Real Time,”
Proc. 21st Int'l Conf. Data Eng. (ICDE), 2005.- [8] Y. Cai and R. Ng, “Indexing Spatio-Temporal Trajectories with Chebyshev Polynomials,”
Proc. ACM SIGMOD, 2004.- [9] K.P. Chan and A.W.-C. Fu, “Efficient Time Series Matching by Wavelets,”
Proc. 15th Int'l Conf. Data Eng., 1999.- [10] L. Chen and R. Ng, “On the Marriage of Edit Distance and $l_p$ Norms,”
Proc. 30th Int'l Conf. Very Large Data Bases (VLDB), 2004.- [11] Q. Chen, L. Chen, X. Lian, Y. Liu, and J.X. Yu, “Indexable PLA for Efficient Similarity Search,”
Proc. 33rd Int'l Conf. Very Large Data Bases (VLDB), 2007.- [12] C. Cranor, T. Johnson, and O. Spatscheck, “Gigascope: A Stream Database for Network Applications,”
Proc. ACM SIGMOD, 2003.- [13] C. Faloutsos, M. Ranganathan, and Y. Manolopoulos, “Fast Subsequence Matching in Time-Series Databases,”
Proc. ACM SIGMOD, 1994.- [14] A. Guttman, “R-trees: A Dynamic Index Structure for Spatial Searching,”
Proc. ACM SIGMOD, 1984.- [15] E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra, “Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases,”
Proc. ACM SIGMOD, 2001.- [16] M. Kontaki and A. Papadopoulos, “Efficient Similarity Search in Streaming Time Sequences,”
Proc. 16th IEEE Conf. Scientific and Statistical Database Management (SSDBM), 2004.- [17] F. Korn, H. Jagadish, and C. Faloutsos, “Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences,”
Proc. ACM SIGMOD, 1997.- [18] X. Lian, L. Chen, X. Yu, G.R. Wang, and G. Yu, “Similarity Match over High Speed Time-Series Streams,”
Proc. 23rd Int'l Conf. Data Eng. (ICDE), 2007.- [19] X. Liu and H. Ferhatosmanoglu, “Efficient k-NN Search on Streaming Data Series,”
Proc. Symp. Spatial and Temporal Databases (SSTD), 2003.- [20] M.L. Lo and C.V. Ravishankar, “Spatial Hash-Joins,”
Proc. ACM SIGMOD, 1996.- [21] S. Michel, P. Triantafillou, and G. Weikum, “Klee: A Framework for Distributed Top-k Query Algorithms,”
Proc. 31st Int'l Conf. Very Large Data Bases (VLDB), 2005.- [22] M.F. Mokbel, M. Lu, and W.G. Aref, “Hash-Merge Join: A Non-Blocking Join Algorithm for Producing Fast and Early Join Results,”
Proc. 20th Int'l Conf. Data Eng. (ICDE), 2004.- [23] Y.S. Moon, K.Y. Whang, and W.S. Han, “Generalmatch: A Subsequence Matching Method in Time-Series Databases Based on Generalized Windows,”
Proc. ACM SIGMOD, 2002.- [24] Y.S. Moon, K.Y. Whang, and W.K. Loh, “Duality-Based Subsequence Matching in Time-Series Databases,”
Proc. 17th Int'l Conf. Data Eng. (ICDE), 2001.- [25] Y. Sakurai, C. Faloutsos, and M. Yamamuro, “Stream Monitoring under the Time Warping Distance,”
Proc. 23rd Int'l Conf. Data Eng. (ICDE), 2007.- [26] S. Subramaniam, T. Palpanas, D. Papadopoulos, V. Kalogeraki, and D. Gunopulos, “Online Outlier Detection in Sensor Data Using Non-Parametric Models,”
Proc. 32nd Int'l Conf. Very Large Data Bases (VLDB), 2006.- [27] Y.F. Tao, M.L. Yiu, D. Papadias, M. Hadjieleftheriou, and N. Mamoulis, “RPJ: Producing Fast Join Results on Streams through Rate-Based Optimization,”
Proc. ACM SIGMOD, 2005.- [28] T. Urhan and M.J. Franklin, “XJoin: A Reactively-Scheduled Pipelined Join Operator,”
IEEE Data Eng. Bull., vol. 23, pp. 27-33, 2000.- [29] M. Vlachos, G. Kollios, and D. Gunopulos, “Discovering Similar Multidimensional Trajectories,”
Proc. 18th Int'l Conf. Data Eng. (ICDE), 2002.- [30] C.Z. Wang and X. Wang, “Supporting Content-Based Searches on Time Series via Approximation,”
Proc. 12th Int'l Conf. Scientific and Statistical Database Management (SSDBM), 2000.- [31] E.W. Weisstein, “Central Limit Theorem,” http://mathworld. wolfram.comCentralLimitTheorem.html . 2009.
- [32] D.A. White and R. Jain, “Similarity Indexing with the SS-Tree,”
Proc. 12th Int'l Conf. Data Eng. (ICDE), 1996.- [33] Y.W. Huang, N. Jing, and E.A. Rundensteiner, “Spatial Joins Using R-Trees: Breadth-First Traversal with Global Optimizations,”
Proc. 23rd Int'l Conf. Very Large Data Bases (VLDB), 1997.- [34] B.-K. Yi and C. Faloutsos, “Fast Time Sequence Indexing for Arbitrary $L_p$ Norms,”
Proc. 26th Int'l Conf. Very Large Data Bases (VLDB), 2000.- [35] Y. Zhu and D. Shasha, “StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time,”
Proc. 28th Int'l Conf. Very Large Data Bases (VLDB), 2002.- [36] Y. Zhu and D. Shasha, “Efficient Elastic Burst Detection in Data Streams,”
Proc. ACM SIGKDD, 2003. |