Subscribe

Issue No.08 - August (2009 vol.21)

pp: 1118-1132

Veronica Lucia Policicchio , University of Calabria, Rende

Pasquale Rullo , University of Calabria, Rende

Salvatore Iiritano , Exeura S.r.l., Rende

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2008.206

ABSTRACT

This paper describes Olex, a novel method for the automatic induction of rule-based text classifiers. Olex supports a hypothesis language of the form "if T_{1} or \cdots or T_{n} occurs in document d, and none of T_{n + 1}, \ldots T_{n + m} occurs in d, then classify d under category c,” where each T_{i} is a conjunction of terms. The proposed method is simple and elegant. Despite this, the results of a systematic experimentation performed on the Reuters-21578, the Ohsumed, and the ODP data collections show that Olex provides classifiers that are accurate, compact, and comprehensible. A comparative analysis conducted against some of the most well-known learning algorithms (namely, Naive Bayes, Ripper, C4.5, SVM, and Linear Logistic Regression) demonstrates that it is more than competitive in terms of both predictive accuracy and efficiency.

INDEX TERMS

Data mining, text mining, clustering, classification, and association rules, mining methods and algorithms.

CITATION

Veronica Lucia Policicchio, Pasquale Rullo, Salvatore Iiritano, "Olex: Effective Rule Learning for Text Categorization",

*IEEE Transactions on Knowledge & Data Engineering*, vol.21, no. 8, pp. 1118-1132, August 2009, doi:10.1109/TKDE.2008.206REFERENCES

- [1] A. Agresti,
Categorical Data Analysis. Wiley-Interscience, 2002.- [2] M. Anthony and N. Biggs,
Computational Learning Theory. Cambridge Univ. Press, 1992.- [3] M. Antonie and O. Zaiane, “An Associative Classifier Based on Positive and Negative Rules,”
Proc. Ninth ACM SIGMOD Workshop Research Issues in Data Mining and Knowledge Discovery (DMKD), 2004.- [6] M.F. Caropreso, S. Matwin, and F. Sebastiani, “A Learner-Independent Evaluation of the Usefulness of Statistical Phrases for Automated Text Categorization,”
Text Databases and Document Management: Theory and Practice, A.G. Chin, ed., pp. 78-102, Idea Group Publishing, 2001.- [7] W.W. Cohen, “Text Categorization and Relational Learning,”
Proc. 12th Int'l Conf. Machine Learning (ICML), 1995.- [10] F. Debole and F. Sebastiani, “An Analysis of the Relative Difficulty of Reuters-21578 Subsets,”
Proc. Fourth Int'l Conf. Language Resources and Evaluation (LREC '04), 2004.- [11] S. Dzeroski, S. Muggleton, and S.J. Russell, “PAC-Learnability of Determinate Logic Programs,”
Proc. Fifth Ann. ACM Workshop Computational Learning Theory (COLT), 1992.- [13] G. Gottlob, N. Leone, and F. Scarcello, “On the Complexity of Some Inductive Logic Programming Problems,”
Proc. Seventh Int'l Workshop Inductive Logic Programming (ILP '97), pp. 17-32, 1997.- [14] W. Hersh, C. Buckley, T. Leone, and D. Hickman, “Ohsumed: An Interactive Retrieval Evaluation and New Large Text Collection for Research,”
Proc. 17th ACM Int'l Conf. Research and Development in Information Retrieval (SIGIR '94), W.B. Croft and C.J. van Rijsbergen, eds., pp. 192-201, 1994.- [15] N. Japkowicz and S. Stephen, “The Class Imbalance Problem: A Systematic Study,”
Intelligent Data Analysis J., vol. 6, no. 5, pp. 429-449, 2002.- [16] T. Joachims, “Text Categorization with Support Vector Machines: Learning with Many Relevant Features,”
Proc. 10th European Conf. Machine Learning (ECML '98), C. Nédellec and C. Rouveirol, eds., pp. 137-142, 1998.- [18] J.-U. Kietz, “Some Lower Bounds for the Computational Complexity of Inductive Logic Programming,”
Proc. Sixth European Conf. Machine Learning (ECML '93), vol. 667, pp. 115-123, 1993.- [20] W. Kloesgen, “Explora: A Multipattern and Multistrategy Discovery Assistant,”
Advances in Knowledge Discovery and Data Mining, pp. 249-271, 1996.- [21] D.D. Lewis, “Reuters-21578 Text Categorization Test Collection,” Distribution 1.0, http:/metaxa.net/, 1997.
- [22] D.D. Lewis and P.J. Hayes, “Guest Editors' Introduction to the Special Issue on Text Categorization,”
ACM Trans. Information Systems, vol. 12, no. 3, p. 231, 1994.- [23] W. Li, J. Han, and J. Pei, “Cmar: Accurate and Efficient Classification Based on Multiple-Class Association Rule,”
Proc. First IEEE Int'l Conf. Data Mining (ICDM), 2001.- [24]
Open Directory Project—ODP, http:/dmoz.org, 2008.- [25] A. Pietramala, V.L. Policicchio, P. Rullo, and I. Sidhu, “A Genetic Algorithm for Text Classification Rule Induction,”
Proc. European Conf. Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD '08), W. Daelemans, B.Goethals, and K. Morik, eds., no. 2, pp. 188-203, 2008.- [26] J.R. Quinlan, “Generating Production Rules from Decision Trees,”
Proc. 10th Int'l Joint Conf. Artificial Intelligence (IJCAI'87), pp. 304-307, 1987.- [30] S. Weiss and N. Indurkhya, “Optimized Rule Induction,”
IEEE Expert, vol. 8, no. 6, pp. 61-69, 1993.- [31] I.H. Witten and E. Frank,
Data Mining: Practical Machine Learning Tools and Techniques, second ed. Morgan Kaufmann, 2005.- [32] X. Wu, C. Zhang, and S. Zhang, “Mining Both Positive and Negative Association Rules,”
Proc. 19th Int'l Conf. Machine Learning '02, pp. 658-665, 2002.- [33] Y. Yang and J.O. Pedersen, “A Comparative Study on Feature Selection in Text Categorization,”
Proc. 14th Int'l Conf. Machine Learning (ICML '97), D.H. Fisher, ed., pp. 412-420, 1997.- [34] Y. Yang and X. Liu, “A Re-Examination of Text Categorization Methods,”
Proc. 22nd ACM Int'l Conf. Research and Development in Information Retrieval (SIGIR '99), pp. 122-130, 1999. |