This Article 
 Bibliographic References 
 Add to: 
Discovering Frequent Episodes and Learning Hidden Markov Models: A Formal Connection
November 2005 (vol. 17 no. 11)
pp. 1505-1517
This paper establishes a formal connection between two common, but previously unconnected methods for analyzing data streams: discovering frequent episodes in a computer science framework and learning generative models in a statistics framework. We introduce a special class of discrete Hidden Markov Models (HMMs), called Episode Generating HMMs (EGHs), and associate each episode with a unique EGH. We prove that, given any two episodes, the EGH that is more likely to generate a given data sequence is the one associated with the more frequent episode. To be able to establish such a relationship, we define a new measure of frequency of an episode, based on what we call nonoverlapping occurrences of the episode in the data. An efficient algorithm is proposed for counting the frequencies for a set of episodes. Through extensive simulations, we show that our algorithm is both effective and more efficient than current methods for frequent episode discovery. We also show how the association between frequent episodes and EGHs can be exploited to assess the significance of frequent episodes discovered and illustrate empirically how this idea may be used to improve the efficiency of the frequent episode discovery.

[1] H. Mannila, H. Toivonen, and A.I. Verkamo, “Discovery of Frequent Episodes in Event Sequences,” Data Mining and Knowledge Discovery, vol. 1, no. 3, pp. 259-289, 1997.
[2] R. Agrawal and R. Srikant, “Mining Sequential Patterns,” Proc. 11th Int'l Conf. Data Eng., Mar. 1995
[3] P. Baldi, Y. Chauvin, T. Hunkapiller, and M. McClure, “Hidden Markov Models of Biological Primary Sequence Information,” Proc. Nat'l Academy of Sciences, vol. 91, pp. 1059-1063, Feb. 1994.
[4] Temporal Data Mining Workshop Notes, K.P. Unnikrishnan and R. Uthurusamy, eds., SIGKDD, Edmonton, Alberta, Canada, July 2002.
[5] C. Larizza, R. Bellazzi, and A. Riva, “Temporal Abstractions for Diabetic Patient Management,” Proc. Sixth Conf. Artificial Intelligence in Medicine in Europe, E. Keravnou, C. Garbay, R. Baud, and J. Wyatt, eds., pp. 319-330, 1997.
[6] J. Lin, E. Keogh, S. Lonardi, and P. Patel, “Finding Motifs in Time Series,” Temporal Data Mining Workshop Notes, K.P. Unnikrishnan and R. Uthurusamy, eds., July 2002.
[7] M. Garofalakis, R. Rastogi, and K. Shim, “Mining Sequential Patterns with Regular Expression Constraints,” IEEE Trans. Knowledge and Data Eng., vol. 14, pp. 530-552, May 2002.
[8] C. Bettini, X.S. Wang, S. Jajodia, and J.L. Lin, “Discovering Frequent Event Patterns with Multiple Granularities in Time Sequences,” IEEE Trans. Knowledge and Data Eng., vol. 10, no. 2, pp. 222-237, Mar./Apr. 1998.
[9] S. Laxman, P.S. Sastry, and K.P. Unnikrishnan, “Generalized Frequent Episodes in Event Sequences,” Temporal Data Mining Workshop Notes, K.P. Unnikrishnan and R. Uthurusamy, eds., July 2002.
[10] J.S. Liu, A.F. Neuwald, and C.E. Lawrence, “Markovian Structures in Biological Sequence Alignments,” J. Am. Statistics Assoc., vol. 94, pp. 1-15, 1999.
[11] D. Chudova and P. Smyth, “Pattern Discovery in Sequences under a Markovian Assumption,” Proc. Eighth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, July 2002.
[12] D.L. Wang and B. Yuwono, “Anticipation-Based Temporal Pattern Generation,” IEEE Trans. Systems, Man, and Cybernetics, vol. 25, no. 4, pp. 615-628, 1995.
[13] F. Korkmazskiy, B.H. Juang, and F. Soong, “Generalized Mixture of HMMs for Continuous Speech Recognition,” Proc. 1997 IEEE Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP-97), vol. 2, pp. 1443-1446, Apr. 1997.
[14] J. Alon, S. Sclaroff, G. Kollios, and V. Pavlovic, “Discovering Clusters in Motion Time Series Data,” Proc. 2003 IEEE CS Conf. Computer Vision and Pattern Recognition, pp. I-375-I-381, June 2003.
[15] P. Smyth, “Data Mining at the Interface of Computer Science and Statistics,” Data Mining for Scientific and Engineering Applications, R.L. Grossman, C. Kamath, P. Kegelmeyer, V. Kumar, and R.R. Namburu, eds. Kluwer Academic Publishers, 2001.
[16] G. Casas-Garriga, “Discovering Unbounded Episodes in Sequential Data,” Proc. Seventh European Conf. Principles and Practice of Knowledge Discovery in Databases (PKDD '03), pp. 83-94 2003.
[17] L.R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,” Proc. IEEE, vol. 77, pp. 257-286, Feb. 1989.
[18] B. Gold and N. Morgan, Speech and Audio Signal Processing: Processing and Perception of Speech and Music. New York: John Wiley & Sons, Inc., 2000.
[19] R. Gwadera, M.J. Atallah, and W. Szpankowski, “Reliable Detection of Episodes in Event Sequences,” Proc. Third IEEE Int'l Conf. Data Mining (ICDM '03), pp. 67-74 Nov. 2003.
[20] M.J. Atallah, R. Gwadera, and W. Szpankowski, “Detection of Significant Sets of Episodes in Event Sequences,” Proc. Fourth IEEE Int'l Conf. Data Mining (ICDM '04), pp. 3-10, Nov. 2004.
[21] R. Gwadera, M.J. Atallah, and W. Szpankowski, “Markov Models for Identification of Significant Episodes,” Proc. 2005 SIAM Int'l Conf. Data Mining (SDM-05), Apr. 2005.
[22] P. Flajolet, Y. Guivarc'h, W. Szpankowski, and B. Vallee, “Hidden Pattern Statistics,” Proc. 28th Int'l Colloquium Automata, Languages, and Programming, pp. 152-165, 2001.

Index Terms:
Index Terms- Temporal data mining, sequential data, frequent episodes, Hidden Markov Models, statistical significance.
Srivatsan Laxman, P.S. Sastry, K.P. Unnikrishnan, "Discovering Frequent Episodes and Learning Hidden Markov Models: A Formal Connection," IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 11, pp. 1505-1517, Nov. 2005, doi:10.1109/TKDE.2005.181
Usage of this product signifies your acceptance of the Terms of Use.