Subscribe

Issue No.11 - November (2010 vol.22)

pp: 1609-1622

Willis Lang , University of Michigan, Ann Arbor

Michael Morse , University of Michigan, Ann Arbor

Jignesh M. Patel , University of Michigan, Ann Arbor

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2009.201

ABSTRACT

Long time-series data sets are common in many domains, especially scientific domains. Applications in these fields often require comparing trajectories using similarity measures. Existing methods perform well for short time series but their evaluation cost degrades rapidly for longer time series. In this work, we develop a new time-series similarity measure called the Dictionary Compression Score (DCS) for determining time-series similarity. We also show that this method allows us to accurately and quickly calculate similarity for both short and long time series. We use the well-known Kolmogorov Complexity in information theory and the Lempel-Ziv compression framework as a basis to calculate similarity scores. We show that off-the-shelf compressors do not fair well for computing time-series similarity. To address this problem, we developed a novel dictionary-based compression technique to compute time-series similarity. We also develop heuristics to automatically identify suitable parameters for our method, thus, removing the task of parameter tuning found in other existing methods. We have extensively compared DCS with existing similarity methods for classification. Our experimental evaluation shows that for long time-series data sets, DCS is accurate, and it is also significantly faster than existing methods.

INDEX TERMS

Spatial databases and GIS, database management.

CITATION

Willis Lang, Michael Morse, Jignesh M. Patel, "Dictionary-Based Compression for Long Time-Series Similarity",

*IEEE Transactions on Knowledge & Data Engineering*, vol.22, no. 11, pp. 1609-1622, November 2010, doi:10.1109/TKDE.2009.201REFERENCES

- [1] The UCR Time Series Classification/Clustering Homepage, www.cs.ucr.edu/~eamonntime_series_data/, 2010.
- [2] R. Agarwal, C. Faloutsos, and A.R. Swami, "Efficient Similarity Search in Sequence Databases,"
Proc. Int'l Conf. Foundations of Data Organization and Algorithms (FODO), pp. 69-84, 1993.- [3] R. Agarwal, K. Gupta, S. Jain, and S. Amalapurapu, "An Approximation to the Greedy Algorithm for Differential Compression,"
IBM J. Research and Development, vol. 50, no. 1, pp. 149-166, 2006.- [4] D. Berndt and J. Clifford, "Using Dynamic Time Warping to Find Patterns in Time Series,"
Proc. AAAI-94 Workshop Knowledge Discovery in Databases, pp. 359-370, 1994.- [5] B. Boucheham, "Matching of Quasi-Periodic Time Series Patterns by Exchange of Block-Sorting Signatures,"
Pattern Recognition Letters, vol. 29, pp. 501-514, 2008.- [6] T. Bozkaya, N. Yazdani, and Z. Ozsoyoglu, "Matching and Indexing Sequences of Different Lengths,"
Proc. Int'l Conf. Information and Knowledge Management (CIKM), 1997.- [7] L. Breiman, "Random Forests,"
Machine Learning, vol. 45, pp. 5-32, 2001.- [8] K. Chan and A.-C. Fu, "Efficient Time Series Matching by Wavelets,"
Proc. Int'l Conf. Data Eng. (ICDE), pp. 126-133, 1999.- [9] C. Chang and C. Lin, "LIBSVM: A Library for Support Vector Machines," www.csie.ntu.edu.tw/~cjlinlibsvm, 2001.
- [10] L. Chen and R. Ng, "On the Marriage of Lp-Norms and Edit Distance,"
Proc. Int'l Conf. Very Large Data Bases (VLDB), pp. 792-803, 2004.- [11] L. Chen, M.T. Özsu, and V. Oria, "Robust and Fast Similarity Search for Moving Object Trajectories,"
Proc. ACM SIGMOD, pp. 491-502, 2005.- [12] X. Chen, B. Francia, M. Li, B. McKinnon, and A. Seker, "Shared Information and Program Plagiarism Detection,"
IEEE Trans. Information Theory, vol. 50, no. 7, pp. 1545-1551, July 2004.- [13] R. Cilibrasi and P.M. Vitanyi, "Clustering by Compression,"
IEEE Trans. Information Theory, vol. 51, no. 4, pp. 1523-1545, Apr. 2005.- [14] T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein,
Introduction to Algorithms. The MIT Press, 2001.- [15] H. Ding, G. Trajcevski, P. Scheuermann, X. Wang, and E. Keogh, "Querying and Mining of Time Series Data: Experimental Comparison of Representations and Distance Measures,"
Proc. Int'l Conf. Very Large Data Bases (VLDB), 2008.- [16] D. Eads, D. Hill, S. Davis, S. Perkins, J. Ma, R. Porter, and J. Theiler, "Genetic Algorithms and Support Vector Machines for Time Series Classification,"
Proc. SPIE, 2002.- [17] A. Elisseeff and M. Pontil, "Leave-One-Out Error and Stability of Learning Algorithms with Applications,"
Advances in Learning Theory: Methods, Models and Applications, vol. 190, IOS Press, 2003.- [18] C. Faloutsos, M. Ranganathan, and Y. Manolopoulos, "Fast Subsequence Matching in Time-Series Databases,"
Proc. ACM SIGMOD, pp. 419-429, 1994.- [19] D. Goldin and P. Kanellakis, "On Similarity Queries for Time-Series Data: Constraint Specification and Implementation,"
Proc. Int'l Conf. Principles and Practice of Constraint Programming, pp. 137-153, 1995.- [20] R.M. Gray, "Vector Quantization,"
IEEE ASSP Magazine, vol. 1, no. 2, pp. 4-29, Apr. 1984.- [21] I. Gronau and S. Moran, "Optimal Implementations of UPGMA and Other Clustering Algorithms,"
Information Processing Letters, vol. 104, pp. 205-210, 2007.- [22] A. Hirsh and H. Fraser, "Protein Dispensability and Rate of Evolution,"
Nature, vol. 411, pp. 1046-1049, 2001.- [23] C. Jeffery, "Synthetic Lightning EMP Data," http://public.lanl. gov/eadsdatasets/, 2005.
- [24] L. Kaufman and P.J. Rousseeuw,
Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Interscience, 1990.- [25] E. Keogh, "Exact Indexing of Dynamic Time Warping,"
Proc. Int'l Conf. Very Large Data Bases (VLDB), pp. 406-417, 2002.- [26] E. Keogh, "Tutorial: Mining Shape and Time Series Databases with Symbolic Representations,"
Proc. SIGKDD, 2007.- [27] E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra, "Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases,"
Knowledge and Information Systems, vol. 3, no. 3, pp. 263-286, 2000.- [28] E. Keogh and S. Kasetty, "The Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration,"
Proc. SIGKDD, pp. 102-111, 2002.- [29] E. Keogh, S. Lonardi, and C.A. Ratanamahatana, "Towards Parameter-Free Data Mining,"
Proc. Int'l Conf. Knowledge Discovery and Data Mining, 2004.- [30] E. Keogh and M. Pazzani, "Scaling up Dynamic Time Warping to Massive Datasets,"
Proc. European Conf. Principles and Practice of Knowledge Discovery in Databases (PKDD), pp. 1-11, 1999.- [31] E.J. Keogh, K. Chakrabarti, S. Mehrotra, and M.J. Pazzani, "Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases,"
Proc. ACM SIGMOD, pp. 151-162, 2001.- [32] S.-W. Kim, S. Park, and W.W. Chu, "An Index-Based Approach for Similarity Search Supporting Time Warping in Large Sequence Databases,"
Proc. Int'l Conf. Data Eng. (ICDE), pp. 607-614, 2001.- [33] J.M. Kleinberg, "Two Algorithms for Nearest-Neighbor Search in High Dimensions,"
Proc. Ann. ACM Symp. Theory of Computing, pp. 599-608, 1997.- [34] T. Kohonen, "Improved Versions of Learning Vector Quantization,"
Proc. Int'l Joint Conf. Neural Networks, 1990.- [35] F. Korn, H. Jagadish, and C. Faloutsos, "Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences,"
Proc. ACM SIGMOD, pp. 289-300, 1997.- [36] N. Kumar, N. Lolla, E. Keogh, S. Lonardi, and C.A. Ratanamahatana, "Time-Series Bitmaps: A Practical Visualization Tool for Working with Large Time Series Databases,"
Proc. Fifth SIAM Int'l Conf. Data Mining, 2005.- [37] W. Lang, M.D. Morse, and J.M. Patel, "Dictionary-Based Compression for Long Time-Series Similarity," http://cs.wisc.edu/~wlangcompress_extended.pdf , 2008.
- [38] M. Li, X. Chen, X. Li, B. Ma, and P. Vitanyi, "The Similarity Metric,"
Proc. 14th Ann. ACM-SIAM Symp. Discrete Algorithms, 2003.- [39] M. Li and P. Vitanyi,
An Introduction to Kolmogorov Complexity and Its Applications. Springer, 1997.- [40] M. Li and Y. Zhu,
Image Classification via LZ78 Based String Kernel: A Comparative Study. Springer, 2006.- [41] J. Lin, E. Keogh, S. Lonardi, and B. Chiu, "A Symbolic Representation of Time Series, with Implications for Streaming Algorithms,"
Proc. Workshop Research Issues in Data Mining and Knowledge Discovery (DMKD), 2003.- [42] A.B. Matos, "Kolmogorov Complexity in Multiplicative Arithmetic," DCC-FCUP technical report, Universidade Do Porto, 2005.
- [43] M. Morse and J.M. Patel, "An Efficient and Accurate Method for Evaluating Time Series Similarity,"
Proc. ACM SIGMOD, 2007.- [44] H.H. Otu and K. Sayood, "A New Sequence Distance Measure for Phylogenetic Tree Construction,"
Bioinformatics, vol. 19, no. 16, pp. 2122-2130, 2003.- [45] I. Popivanov and R. Miller, "Similarity Search over Time Series Data Using Wavelets,"
Proc. Int'l Conf. Data Eng. (ICDE), pp. 212-221, 2001.- [46] C. Ratanamahatana and E. Keogh, "Making Time-Series Classification More Accurate Using Learned Constraints,"
Proc. SIAM Int'l Conf. Data Mining, 2004.- [47] C. Ratanamahatana and E. Keogh, "Three Myths about Dynamic Time Warping,"
Proc. SIAM Int'l Conf. Data Mining, 2005.- [48] H. Sakoe and S. Chiba, "Dynamic Programming Algorithm Optimization for Spoken Word Recognition,"
IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 26, no. 1, pp. 43-49, Feb. 1978.- [49] Y. Sakurai, M. Yoshikawa, and C. Faloutsos, "FTW: Fast Similarity Search under the Time Warping Distance,"
Proc. Symp. Principles of Database Systems (PODS), pp. 326-337, 2005.- [50] P.H.E. Sneath and R.R. Sokal,
Numerical Taxonomy. Freeman, 1973.- [51] T. Syeda Mahmood, D. Beymer, and F. Wang, "Shape-Based Matching of ECG Recordings,"
Proc. IEEE Int'l Conf. Eng. in Medicine and Biology (EMBS), 2007.- [52] K. Ueno, X. Xi, E. Keogh, and D.-J. Lee, "Anytime Classification Using the Nearest Neighbor Algorithm with Applications to Stream Mining,"
Proc. Int'l Conf. Data Mining, pp. 623-632, 2006.- [53] N.K. Vereshchagin and P.M. Vitanyi, "Kolmogorov's Structure Functions and Model Selection,"
Trans. Information Theory, vol. 50, no. 12, pp. 3265-3290, Dec. 2004.- [54] M. Vlachos, M. Hadjieleftheriou, D. Gunopulos, and E. Keogh, "Indexing Multi-Dimensional Time-Series with Support for Multiple Distance Measures,"
Proc. SIGKDD, pp. 216-225, 2003.- [55] M. Vlachos, G. Kollios, and D. Gunopulos, "Discovering Similar Multidimensional Trajectories,"
Proc. Int'l Conf. Data Eng. (ICDE), pp. 673-684, 2002.- [56] B.-K. Yi and C. Faloutsos, "Fast Time Sequence Indexing for Arbitrary Lp Norms,"
Proc. Int'l Conf. Very Large Data Bases (VLDB), pp. 385-394, 2000.- [57] B.-K. Yi, H.V. Jagadish, and C. Faloutsos, "Efficient Retrieval of Similar Time Sequences under Time Warping,"
Proc. Int'l Conf. Data Eng. (ICDE), pp. 201-208, 1998.- [58] Y. Zhu and D. Shasha, "Warping Indexes with Envelope Transforms for Query by Humming,"
Proc. ACM SIGMOD, pp. 181-192, 2003.- [59] J. Ziv and A. Lempel, "A Universal Algorithm for Sequential Data Compression,"
IEEE Trans. Information Theory, vol. 23, no. 3, pp. 337-343, May 1977. |