This Article 
 Bibliographic References 
 Add to: 
Efficient Classification across Multiple Database Relations: A CrossMine Approach
June 2006 (vol. 18 no. 6)
pp. 770-783
Relational databases are the most popular repository for structured data, and is thus one of the richest sources of knowledge in the world. In a relational database, multiple relations are linked together via entity-relationship links. Multirelational classification is the procedure of building a classifier based on information stored in multiple relations and making predictions with it. Existing approaches of Inductive Logic Programming (recently, also known as Relational Mining) have proven effective with high accuracy in multirelational classification. Unfortunately, most of them suffer from scalability problems with regard to the number of relations in databases. In this paper, we propose a new approach, called CrossMine, which includes a set of novel and powerful methods for multirelational classification, including 1) tuple ID propagation, an efficient and flexible method for virtually joining relations, which enables convenient search among different relations, 2) new definitions for predicates and decision-tree nodes, which involve aggregated information to provide essential statistics for classification, and 3) a selective sampling method for improving scalability with regard to the number of tuples. Based on these techniques, we propose two scalable and accurate methods for multirelational classification: CrossMine-Rule, a rule-based method and CrossMine-Tree, a decision-tree-based method. Our comprehensive experiments on both real and synthetic data sets demonstrate the high scalability and accuracy of the CrossMine approach.

[1] A. Appice, M. Ceci, and D. Malerba, “Mining Model Trees: A Multi-Relational Approach,” Proc. 2003 Int'l Conf. Inductive Logic Programming, Sept. 2003.
[2] J.M. Aronis, F.J. Provost, “Increasing the Efficiency of Data Mining Algorithms with Breadth-First Marker Propagation,” Proc. 2003 Int'l Conf. Knowledge Discovery and Data Mining, 1997.
[3] H. Blockeel, L. De Raedt, and J. Ramon, “Top-Down Induction of Logical Decision Trees,” Proc. 1998 Int'l Conf. Machine Learning (ICML '98), Aug. 1998.
[4] H. Blockeel, L. De Raedt, N. Jacobs, and B. Demoen, “Scaling Up Inductive Logic Programming by Learning from Interpretations,” Data Mining and Knowledge Discovery, vol. 3, no. 1, pp. 59-93, 1999.
[5] H. Blockeel, L. Dehaspe, B. Demoen, G. Janssens, J. Ramon, and H. Vandecasteele, “Improving the Efficiency of Inductive Logic Programming through the Use of Query Packs,” J. Artificial Intelligence Research, vol. 16, pp. 135-166, 2002.
[6] C.J.C. Burges, “A Tutorial on Support Vector Machines for Pattern Recognition,” Data Mining and Knowledge Discovery, vol. 2, pp. 121-168, 1998.
[7] P. Clark and R. Boswell, “Rule Induction with CN2: Some Recent Improvements,” Proc. 1991 European Working Session on Learning (EWSL '91), Mar. 1991.
[8] H. Garcia-Molina, J.D. Ullman, and J. Widom, Database Systems: The Complete Book. Prentice Hall, 2002.
[9] J. Gehrke, R. Ramakrishnan, and V. Ganti, “Rainforest: A Framework for Fast Decision Tree Construction of Large Data Sets,” Proc. 1998 Int'l Conf. Very Large Data Bases (VLDB '98), Aug. 1998.
[10] N. Lavrac and S. Dzeroski, Inductive Logic Programming: Techniques and Applications. Ellis Horwood, 1994.
[11] H. Liu, H. Lu, and J. Yao, “Identifying Relevant Databases for Multidatabase Mining,” Proc. Pacific-Asia Conf. Knowledge Discovery and Data Mining, 1998.
[12] T.M. Mitchell, Machine Learning. McGraw Hill, 1997.
[13] S. Muggleton, Inductive Logic Programming. New York: Academic Press, 1992.
[14] S. Muggleton, “Inverse Entailment and Progol,” New Generation Computing, special issue on inductive logic programming, 1995.
[15] S. Muggleton and C. Feng, “Efficient Induction of Logic Programs,” Proc. 1990 Conf. Algorithmic Learning Theory, 1990.
[16] J. Neville, D. Jensen, L. Friedland, and M. Hay, “Learning Relational Probability Trees,” Proc. 2003 Int'l Conf. Knowledge Discovery and Data Mining, 2003.
[17] A. Popescul, L. Ungar, S. Lawrence, and M. Pennock, “Towards Structural Logistic Rregression: Combining Relational and Statistical Learning,” Proc. Multi-Relational Data Mining Workshop, 2002.
[18] J.R. Quinlan, C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
[19] J.R. Quinlan and R.M. Cameron-Jones, “FOIL: A Midterm Report,” Proc. 1993 European Conf. Machine Learning, 1993.
[20] B. Taskar, E. Segal, and D. Koller, “Probabilistic Classification and Clustering in Relational Data,” Proc. 2001 Int'l Joint Conf. Artificial Intelligence, 2001.
[21] X. Wu and S. Zhang, “Synthesizing High-Frequency Rules from Different Data Sources,” IEEE Trans. Knowledge and Data Eng., vol. 15, no. 2, pp. 353-367, 2003.
[22] S. Zhang, X. Wu, and C. Zhang, “Multi-Database Mining,” IEEE Computational Intelligence Bull., vol. 2, no. 1, pp. 5-13, 2003.

Index Terms:
Data mining, classification, relational databases.
Xiaoxin Yin, Jiawei Han, Jiong Yang, Philip S. Yu, "Efficient Classification across Multiple Database Relations: A CrossMine Approach," IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 6, pp. 770-783, June 2006, doi:10.1109/TKDE.2006.94
Usage of this product signifies your acceptance of the Terms of Use.