This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach
November 2004 (vol. 16 no. 11)
pp. 1424-1440
Sequential pattern mining is an important data mining problem with broad applications. However, it is also a difficult problem since the mining may have to generate or examine a combinatorially explosive number of intermediate subsequences. Most of the previously developed sequential pattern mining methods, such as GSP, explore a candidate generation-and-test approach [1] to reduce the number of candidates to be examined. However, this approach may not be efficient in mining large sequence databases having numerous patterns and/or long patterns. In this paper, we propose a projection-based, sequential pattern-growth approach for efficient mining of sequential patterns. In this approach, a sequence database is recursively projected into a set of smaller projected databases, and sequential patterns are grown in each projected database by exploring only locally frequent fragments. Based on an initial study of the pattern growth-based sequential pattern mining, FreeSpan [8], we propose a more efficient method, called PSP, which offers ordered growth and reduced projected databases. To further improve the performance, a pseudoprojection technique is developed in PrefixSpan. A comprehensive performance study shows that PrefixSpan, in most cases, outperforms the a priori-based algorithm GSP, FreeSpan, and SPADE [29] (a sequential pattern mining algorithm that adopts vertical data format), and PrefixSpan integrated with pseudoprojection is the fastest among all the tested algorithms. Furthermore, this mining methodology can be extended to mining sequential patterns with user-specified constraints. The high promise of the pattern-growth approach may lead to its further extension toward efficient mining of other kinds of frequent patterns, such as frequent substructures.

[1] R. Agrawal and R. Srikant, Fast Algorithms for Mining Association Rules Proc. 1994 Int'l Conf. Very Large Data Bases (VLDB '94), pp. 487-499, Sept. 1994.
[2] R. Agrawal and R. Srikant, “Mining Sequential Patterns,” Proc. 1995 Int'l Conf. Data Eng., pp. 3-14, Mar. 1995.
[3] R.J. Bayardo, Efficiently Mining Long Patterns from Databases Proc. 1998 ACM-SIGMOD Int'l Conf. Management of Data (SIGMOD '98), pp. 85-93, June 1998.
[4] R.J. Bayardo, R. Agrawal, and D. Gunopulos, Constraint-Based Rule Mining on Large, Dense Data Sets Proc. 1999 Int'l Conf. Data Eng. (ICDE '99), pp. 188-197, Apr. 1999.
[5] C. Bettini, X.S. Wang, and S. Jajodia, Mining Temporal Relationships with Multiple Granularities in Time Sequences Data Eng. Bull., vol. 21, pp. 32-38, 1998.
[6] S. Guha, R. Rastogi, and K. Shim, ROCK: A Robust Clustering Algorithm For Categorical Attributes Proc. 15th Int'l Conf. Data Eng., pp. 512-521, 1999.
[7] J. Han, G. Dong, and Y. Yin, Efficient Mining of Partial Periodic Patterns in Time Series Database Proc. 15th Int'l Conf. Data Eng., pp. 106-115, Mar. 1999.
[8] J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M.-C. Hsu, FreeSpan: Frequent Pattern-Projected Sequential Pattern Mining Proc. 2000 ACM SIGKDD Int'l Conf. Knowledge Discovery in Databases (KDD '00), pp. 355-359, Aug. 2000.
[9] J. Han, J. Pei, and Y. Yin, Mining Frequent Patterns without Candidate Generation Proc. 2000 ACM-SIGMOD Int'l Conf. Management of Data (SIGMOD '00), pp. 1-12, May 2000.
[10] R. Kohavi, C. Brodley, B. Frasca, L. Mason, and Z. Zheng, KDD-Cup 2000 Organizers' Report: Peeling the Onion Proc. SIGKDD Explorations, vol. 2, pp. 86-98, 2000.
[11] M. Kuramochi and G. Karypis, Frequent Subgraph Discovery Proc. 2001 IEEE Int'l Conf. Data Mining (ICDM), Nov. 2001.
[12] H. Lu, J. Han, and L. Feng, Stock Movement and n-Dimensional Inter-Transaction Association Rules Proc. 1998 SIGMOD Workshop Research Issues on Data Mining and Knowledge Discovery (DMKD '98), vol. 12, pp. 1-7, June 1998.
[13] H. Mannila, H Toivonen, and A.I. Verkamo, Discovery of Frequent Episodes in Event Sequences Data Mining and Knowledge Discovery, vol. 1, pp. 259-289, 1997.
[14] F. Masseglia, F. Cathala, and P. Poncelet, The PSP Approach for Mining Sequential Patterns Proc. 1998 European Symp. Principle of Data Mining and Knowledge Discovery (PKDD '98), pp. 176-184, Sept. 1998.
[15] R. Ng, L.V.S. Lakshmanan, J. Han, and A. Pang, Exploratory Mining and Pruning Optimizations of Constrained Associations Rules Proc. 1998 ACM-SIGMOD Int'l Conf. Management of Data (SIGMOD '98), pp. 13-24, June 1998.
[16] B. Ozden, S. Ramaswamy, and A. Silberschatz, Cyclic Association Rules Proc. 14th Int'l Conf. Data Eng., Apr. 1998.
[17] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal, Discovering Frequent Closed Itemsets for Association Rules Proc. Seventh Int'l Conf. Database Theory (ICDT '99), pp. 398-416, Jan. 1999.
[18] J. Pei, J. Han, and L.V.S. Lakshmanan, Mining Frequent Itemsets with Convertible Constraints Proc. Int'l Conf. Data Eng., 2001.
[19] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu, PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth Proc. 2001 Int'l Conf. Data Eng. (ICDE '01), pp. 215-226, 2001.
[20] J. Pei, J. Han, and W. Wang, Constraint-Based Sequential Pattern Mining in Large Databases Proc. 2002 Int'l Conf. Information and Knowledge Management (CIKM '02), pp. 18-25, Nov. 2002.
[21] H. Pinto, J. Han, J. Pei, K. Wang, Q. Chen, and U. Dayal, Multi-Dimensional Sequential Pattern Mining Proc. 2001 Int'l Conf. Information and Knowledge Management (CIKM '01), pp. 81-88, Nov. 2001.
[22] S. Ramaswamy, S. Mahajan, and A. Silberschatz, On the Discovery of Interesting Patterns in Association Rules Proc. 1998 Int'l Conf. Very Large Data Bases (VLDB '98), pp. 368-379, Aug. 1998.
[23] R. Srikant and R. Agrawal, Mining Sequential Patterns: Generalizations and Performance Improvements Proc. Fifth Int'l Conf. Extending Database Technology (EDBT '96), pp. 3-17, Mar. 1996.
[24] J. Wang, G. Chirn, T. Marr, B. Shapiro, D. Shasha, and K. Zhang, Combinatiorial Pattern Discovery for Scientific Data: Some Preliminary Results Proc. 1994 ACM-SIGMOD Int'l Conf. Management of Data (SIGMOD '94), pp. 115-125, May 1994.
[25] X. Yan and J. Han, gSpan: Graph-Based Substructure Pattern Mining Proc. 2002 IEEE Int'l Conf. Data Mining (ICDM), 2002.
[26] X. Yan and J. Han, CloseGraph: Mining Closed Frequent Graph Patterns Proc. 2003 ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '03), Aug. 2003.
[27] X. Yan, J. Han, and R. Afshar, CloSpan: Mining Closed Sequential Patterns in Large Datasets Proc. 2003 SIAM Int'l Conf. Data Mining (SDM '03), pp. 166-177, May 2003.
[28] J. Yang, W. Wang, P.S. Yu, and J. Han, Mining Long Sequential Patterns in a Noisy Environment Proc. 2002 ACM-SIGMOD Int'l Conf. Management of Data (SIGMOD '02), pp. 406-417, June 2002.
[29] M. Zaki, SPADE: An Efficient Algorithm for Mining Frequent Sequences Machine Learning, vol. 40, pp. 31-60, 2001.
[30] M.J. Zaki, Efficient Enumeration of Frequent Sequences Proc. Seventh Int'l Conf. Information and Knowledge Management (CIKM '98), pp. 68-75, Nov. 1998.
[31] M.J. Zaki, Efficiently Mining Frequent Trees in a Forest Proc. 2002 ACM SIGKDD Int'l Conf. Knowledge Discovery in Databases (KDD '02), pp. 71-80, July 2002.

Index Terms:
Data mining algorithm, sequential pattern, frequent pattern, transaction database, sequence database, scalability, performance analysis.
Citation:
Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, Jianyong Wang, Helen Pinto, Qiming Chen, Umeshwar Dayal, Mei-Chun Hsu, "Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach," IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 11, pp. 1424-1440, Nov. 2004, doi:10.1109/TKDE.2004.77
Usage of this product signifies your acceptance of the Terms of Use.