• Publication
  • 2006
  • Issue No. 6 - June
  • Abstract - Fast Discovery and the Generalization of Strong Jumping Emerging Patterns for Building Compact and Accurate Classifiers
 This Article 
 Bibliographic References 
 Add to: 
Fast Discovery and the Generalization of Strong Jumping Emerging Patterns for Building Compact and Accurate Classifiers
June 2006 (vol. 18 no. 6)
pp. 721-737
Classification of large data sets is an important data mining problem that has wide applications. Jumping Emerging Patterns (JEPs) are those itemsets whose supports increase abruptly from zero in one data set to nonzero in another data set. In this paper, we propose a fast, accurate, and less complex classifier based on a subset of JEPs, called Strong Jumping Emerging Patterns (SJEPs). The support constraint of SJEP removes potentially less useful JEPs while retaining those with high discriminating power. Previous algorithms based on the manipulation of border [1] as well as consEPMiner [2] cannot directly mine SJEPs. Here, we present a new tree-based algorithm for their efficient discovery. Experimental results show that: 1) the training of our classifier is typically 10 times faster than earlier approaches, 2) our classifier uses much fewer patterns than the JEP-Classifier [3] to achieve a similar (and, often, improved) accuracy, and 3) in many cases, it is superior to other state-of-the-art classification systems such as Naive Bayes, CBA, C4.5, and bagged and boosted versions of C4.5. We argue that SJEPs are high-quality patterns which possess the most differentiating power. As a consequence, they represent sufficient information for the construction of accurate classifiers. In addition, we generalize these patterns by introducing Noise-tolerant Emerging Patterns (NEPs) and Generalized Noise-tolerant Emerging Patterns (GNEPs). Our tree-based algorithms can be adopted to easily discover these variations. We experimentally demonstrate that SJEPs, NEPs, and GNEPs are extremely useful for building effective classifiers that can deal well with noise.

[1] G. Dong and J. Li, “Efficient Mining of Emerging Patterns: Discovering Trends and Differences,” Proc. Fifth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '99), pp. 43-52, Aug. 1999.
[2] X. Zhang, G. Dong, and K. Ramamohanarao, “Exploring Constraints to Efficiently Mine Emerging Patterns from Large High-Dimensional Data Sets,” Proc. Sixth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '00), pp. 310-314, Aug. 2000.
[3] J. Li, G. Dong, and K. Ramamohanarao, “Making Use of the Most Expressive Jumping Emerging Patterns for Classification,” Knowledge Information Systems, vol. 3, no. 2, pp. 131-145, 2001.
[4] T. Mitchell, Machine Learning. McGraw Hill, 1997.
[5] J. Han and M. Kamber, Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, 2000.
[6] U.M. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, “From Data Mining to Knowledge Discovery in Databases,” AI Magazine, vol. 17, pp. 37-54, 1996.
[7] R. Brachman, T. Khabaza, W. Kloesgen, G. Piatetsky-Shapiro, and E. Simoudis, “Mining Business Databases,” Comm. ACM, vol. 39, no. 11, pp. 42-48, 1996.
[8] J. Li, K. Ramamohanarao, and G. Dong, “The Space of Jumping Emerging Patterns and Its Incremental Maintenance Algorithms,” Proc. 17th Int'l Conf. Machine Learning (ICML '00), pp. 551-558, 2000.
[9] J. Li, T. Manoukian, G. Dong, and K. Ramamohanarao, “Incremental Maintenance on the Border of the Space of Emerging Patterns,” Data Mining and Knowledge Discovery, vol. 9, no. 1, pp. 89-116, 2004.
[10] J.R. Quinlan, C4.5: Programs for Machine Learning. San Mateo, Calif.: Morgan Kaufmann, 1993.
[11] B. Liu, W. Hsu, and Y. Ma, “Integrating Classification and Association Rule Mining,” Proc. Fourth Int'l Conf. Knowledge Discovery and Data Mining (KDD-98), pp. 80-86, 1998.
[12] J. Bailey, T. Manoukian, and K. Ramamohanarao, “Fast Algorithms for Mining Emerging Patterns,” Proc. Sixth European Conf. Principles and Practice of Knowledge Discovery in Databases (PKDD '02), 2002.
[13] C.L. Blake and C.J. Merz, “UCI Repository of Machine Learning Databases,” 1998, http://www.ics.uci.edu/~mlearnMLRepository.html .
[14] R. Agrawal, S.P. Ghosh, T. Imielinski, B.R. Iyer, and A.N. Swami, “An Interval Classifier for Database Mining Applications,” Proc. 18th Int'l Conf. Very Large Data Bases, pp. 560-573, 1992.
[15] H. Fan and K. Ramamohanarao, “An Efficient Single-Scan Algorithm for Mining Essential Jumping Emerging Patterns for Classification,” Proc. Sixth Pacific-Asia Conf. Knowledge Discovery and Data Mining (PAKDD '02), pp. 456-462, May 2002.
[16] G. Dong, X. Zhang, L. Wong, and J. Li, “CAEP: Classification by Aggregating Emerging Patterns,” Proc. Second Int'l Conf. Discovery Science (DS '99), pp. 30-42, Dec. 1999.
[17] L. Breiman, “Random Forests,” Machine Learning, vol. 45, no. 1, pp. 5-32, 2001.
[18] U.M. Fayyad and K.B. Irani, “Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning,” Proc. 13th Int'l Joint Conf. Artificial Intelligence (IJCAI), pp. 1022-1029, 1993.
[19] R. Kohavi, G. John, R. Long, D. Manley, and K. Pfleger, “MLC++: A Machine Learning Library in C++,” Tools with Artificial Intelligence, pp. 740-743, 1994.
[20] I.H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. San Francisco: Morgan Kaufmann, 1999.
[21] J. Li, H. Liu, J.R. Downing, A.E.-J. Yeoh, and L. Wong, “Simple Rules Underlying Gene Expression Profiles of More than Six Subtypes of Acute Lymphoblastic Leukemia (All) Patients,” Bioinformatics, vol. 19, no. 1, pp. 71-78, 2003.
[22] J. Li and L. Wong, “Identifying Good Diagnostic Gene Groups from Gene Expression Profiles Using the Concept of Emerging Patterns,” Bioinformatics, vol. 18, no. 5, pp. 725-734, 2002.
[23] J. Li, H. Liu, S.-K. Ng, and L. Wong, “Discovery of Significant Rules for Classifying Cancer Diagnosis Data,” Bioinformatics, vol. 19, no. 2, pp. 93-102, 2003.
[24] T. Mitchell, “Generalization as Search,” Artificial Intelligence, vol. 18, no. 2, 1982.
[25] C.M. Bishop and C. Bishop, Neural Networks for Pattern Recognition. Oxford Univ. Press, 1995.
[26] B. Ripley, Pattern Recognition and Neural Networks. Cambridge Univ. Press, 1996.
[27] A.A. Freitas, Data Mining and Knowledge Discovery with Evolutionary Algorithms. Berlin: Spinger-Verlag, 2002.
[28] P. Cheeseman and J. Stutz, “Bayesian Classification (Autoclass): Theory and Results,” Proc. Second Int'l Conf. Knowledge Discovery and Data Mining (KDD '96), pp. 153-180, 1996.
[29] R. Christensen, Log-Linear Models and Logistic Regression. Springer, 1997.
[30] D. Aha, D. Kibler, and M. Albert, “Instance-Based Learning Algorithms,” Machine Learning, vol. 6, pp. 37-66, 1991.
[31] B. Dasarathy, Nearest Neighbor Norms: NN Pattern Classification Techniques. Los Alamitos, Calif.: IEEE CS Press, 1991.
[32] L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone, Classification and Regression Trees. New York: Chapman & Hall, 1984.
[33] J.R. Quinlan, “Induction of Decision Trees,” Machine Learning, vol. 1, pp. 81-106, 1986.
[34] RuleQuest, “See5/c5.0,” rULEQUEST RESEARCH Data Mining Tools, http:/www.rulequest.com/, 2000.
[35] J. Han, J. Pei, and Y. Yin, “Mining Frequent Patterns without Candidate Generation,” Proc. 2000 ACM-SIGMOD Int'l Conf. Management of Data (SIGMOD '00), pp. 1-12, May 2000.
[36] J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang, “H-Mine: Hyper-Structure Mining of Frequent Patterns in Large Databases,” Proc. 2001 Int'l Conf. Data Mining (ICDM '01), 2001.
[37] C. Yang, U. Fayyad, and P.S. Bradley, “Efficient Discovery of Error-Tolerant Frequent Itemsets in High Dimensions,” Proc. Seventh ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '01), pp. 194-203, 2001.
[38] M. Seno and G. Karypis, “Lpminer: An Algorithm for Finding Frequent Itemsets Using Length-Decreasing Support Constraint,” Proc. First IEEE Int'l Conf. Data Mining (ICDM '01), pp. 505-512, 2001.
[39] J. Li, X. Zhang, G. Dong, K. Ramamohanarao, and Q. Sun, “Efficient Mining of High Confidience Association Rules without Support Thresholds,” Proc. Third European Conf. Principles of Data Mining and Knowledge Discovery (PKDD '99), pp. 406-411, 1999.

Index Terms:
Data mining, machine learning, emerging patterns, classification, frequent patterns, mining methods and algorithms.
Hongjian Fan, Kotagiri Ramamohanarao, "Fast Discovery and the Generalization of Strong Jumping Emerging Patterns for Building Compact and Accurate Classifiers," IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 6, pp. 721-737, June 2006, doi:10.1109/TKDE.2006.95
Usage of this product signifies your acceptance of the Terms of Use.