This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Efficient Correlation Search from Graph Databases
December 2008 (vol. 20 no. 12)
pp. 1601-1615
Yiping Ke, The Hong Kong University of Science and Technology, Hong Kong
James Cheng, The Hong Kong University of Science and Technology, Hong Kong
Wilfred Ng, The Hong Kong University of Science and Technology, Hong Kong
Correlation mining has gained great success in many application domains for its ability to capture the underlying dependency between objects. However, research on correlation mining from graph databases is still lacking despite the proliferation of graph data in recent years. We propose a new problem of correlation mining from graph databases, called Correlated Graph Search (CGS). CGS adopts Pearson's correlation coefficient to take into account the occurrence distributions of graphs. However, the problem poses significant challenges, since every subgraph of a graph in the database is a candidate but the number of subgraphs is exponential. We derive two necessary conditions that set bounds on the occurrence probability of a candidate in the database. With this result, we devise an efficient algorithm that mines the candidate set from a much smaller projected database and thus a significantly smaller set of candidates is obtained. Three heuristic rules are further developed to refine the candidate set. We also make use of the bounds to directly answer high-support queries without mining the candidates. Experimental results justify the efficiency of our algorithm. Finally, we generalize the CGS problem and show that our algorithm provides a general solution to most of the existing correlation measures.

[1] S. Brin, R. Motwani, and C. Silverstein, “Beyond Market Baskets: Generalizing Association Rules to Correlations,” Proc. ACM SIGMOD '97, pp. 265-276, 1997.
[2] S. Ma and J.L. Hellerstein, “Mining Mutually Dependent Patterns,” Proc. IEEE Int'l Conf. Data Mining (ICDM '01), pp. 409-416, 2001.
[3] E.R. Omiecinski, “Alternative Interest Measures for Mining Associations in Databases,” IEEE Trans. Knowledge and Data Eng., vol. 15, no. 1, pp. 57-69, Jan./Feb. 2003.
[4] H. Xiong, P.-N. Tan, and V. Kumar, “Hyperclique Pattern Discovery,” Proc. ACM SIGMOD Workshop Research Issues in Data Mining and Knowledge Discovery (DMKD '06), vol. 13, no. 2, pp.219-242, 2006.
[5] H. Xiong, S. Shekhar, P.-N. Tan, and V. Kumar, “TAPER: A Two-Step Approach for All-Strong-Pairs Correlation Query in Large Databases,” IEEE Trans. Knowledge and Data Eng., vol. 18, no. 4, pp.493-508, Apr. 2006.
[6] J. Zhang and J. Feigenbaum, “Finding Highly Correlated Pairs Efficiently with Powerful Pruning,” Proc. Conf. Information and Knowledge Management (CIKM '06), pp. 152-161, 2006.
[7] Y. Ke, J. Cheng, and W. Ng, “Mining Quantitative Correlated Patterns Using an Information-Theoretic Approach,” Proc. Int'l Conf. Knowledge Discovery and Data Mining (KDD '06), pp. 227-236, 2006.
[8] J.-Y. Pan, H.-J. Yang, C. Faloutsos, and P. Duygulu, “Automatic Multimedia Cross-Modal Correlation Discovery,” Proc. Int'l Conf. Knowledge Discovery and Data Mining (KDD '04), pp. 653-658, 2004.
[9] Y. Sakurai, S. Papadimitriou, and C. Faloutsos, “AutoLag: Automatic Discovery of Lag Correlations in Stream Data,” Proc. IEEE Int'l Conf. Data Eng. (ICDE '05), pp. 159-160, 2005.
[10] H. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. Bhat, H. Weissig, I. Shindyalov, and P. Bourne, “The Protein Data Bank,” Nucleic Acids Research, vol. 28, pp. 235-242, 2000.
[11] M. Kanehisa and S. Goto, “KEGG: Kyoto Encyclopedia of Genes and Genomes,” Nucleic Acids Research, vol. 28, pp. 27-30, 2000.
[12] National Library of Medicine, http://chem.sis.nlm.nih.govchemidplus, 2008.
[13] The International Network for Social Network Analysis, http:/www.insna.org/, 2008.
[14] S. Raghavan and H. Garcia-Molina, “Representing Web Graphs,” Proc. IEEE Int'l Conf. Data Eng. (ICDE '03), pp. 405-416, 2003.
[15] DBLP Dataset, http://dblp.uni-trier.dexml/, 2008.
[16] Y. Ke, J. Cheng, and W. Ng, “Correlation Search in Graph Databases,” Proc. Int'l Conf. Knowledge Discovery and Data Mining (KDD '07), pp. 390-399, 2007.
[17] P.-N. Tan, V. Kumar, and J. Srivastava, “Selecting the Right Interestingness Measure for Association Patterns,” Proc. Int'l Conf. Knowledge Discovery and Data Mining (KDD '02), pp. 32-41, 2002.
[18] L. Holder, D. Cook, and S. Djoko, “Substructure Discovery in the Subdue System,” Proc. Int'l Conf. Knowledge Discovery and Data Mining (KDD '94), pp. 169-180, 1994.
[19] J.W. Raymond, E.J. Gardiner, and P. Willett, “RASCAL: Calculation of Graph Similarity Using Maximum Common Edge Subgraphs,” Computer J., vol. 45, no. 6, pp. 631-644, 2002.
[20] X. Yan, F. Zhu, P.S. Yu, and J. Han, “Feature-Based Similarity Search in Graph Structures,” ACM Trans. Database Systems, vol. 31, no. 4, pp. 1418-1453, 2006.
[21] H. He and A.K. Singh, “Closure-Tree: An Index Structure for Graph Queries,” Proc. IEEE Int'l Conf. Data Eng. (ICDE '06), p. 38, 2006.
[22] D. Williams, J. Huan, and W. Wang, “Graph Database Indexing Using Structured Graph Decomposition,” Proc. IEEE Int'l Conf. Data Eng. (ICDE '07), pp. 976-985, 2007.
[23] S.A. Cook, “The Complexity of Theorem-Proving Procedures,” Proc. Ann. Symp. Theory of Computing (STOC '71), pp. 151-158, 1971.
[24] M. Kuramochi and G. Karypis, “Frequent Subgraph Discovery,” Proc. IEEE Int'l Conf. Data Mining (ICDM '01), pp. 313-320, 2001.
[25] A. Inokuchi, T. Washio, and H. Motoda, “An A Priori-Based Algorithm for Mining Frequent Substructures from Graph Data,” Proc. European Conf. Principles and Practice of Knowledge Discovery (PKDD '00), pp. 13-23, 2000.
[26] X. Yan and J. Han, “Gspan: Graph-Based Substructure Pattern Mining,” Proc. IEEE Int'l Conf. Data Mining (ICDM '02), p. 721, 2002.
[27] H. Reynolds, The Analysis of Cross-Classifications. The Free Press, 1977.
[28] G.U. Yule, “On the Methods of Measuring Association between Two Attributes,” J. Royal Statistical Soc., vol. 75, no. 6, pp. 579-652, 1912.
[29] S. Nijssen and J.N. Kok, “A Quickstart in Frequent Structure Mining Can Make a Difference,” Proc. Int'l Conf. Knowledge Discovery and Data Mining (KDD '04), pp. 647-652, 2004.
[30] X. Yan, P.S. Yu, and J. Han, “Graph Indexing Based on Discriminative Frequent Structure Analysis,” ACM Trans. Database Systems, vol. 30, no. 4, pp. 960-993, 2005.
[31] J. Cheng, Y. Ke, and W. Ng, “FG-Index: Towards Verification-Free Query Processing on Graph Databases,” Proc. ACM SIGMOD '07, pp. 857-872, 2007.
[32] G. Piatetsky-Shapiro, “Discovery, Analysis, and Presentation of Strong Rules,” Knowledge Discovery in Databases, pp. 229-248, 1991.

Index Terms:
Data mining, Mining methods and algorithms
Citation:
Yiping Ke, James Cheng, Wilfred Ng, "Efficient Correlation Search from Graph Databases," IEEE Transactions on Knowledge and Data Engineering, vol. 20, no. 12, pp. 1601-1615, Dec. 2008, doi:10.1109/TKDE.2008.86
Usage of this product signifies your acceptance of the Terms of Use.