Subscribe
Issue No.03 - March (2012 vol.24)
pp: 440-451
Guoren Wang , Northeastern University, Shenyang
Bin Wang , Northeastern University, Shenyang
Xiaochun Yang , Northeastern University, Shenyang
Ge Yu , Northeastern University, Shenyang
ABSTRACT
The graph structure is a very important means to model schemaless data with complicated structures, such as protein-protein interaction networks, chemical compounds, knowledge query inferring systems, and road networks. This paper focuses on the index structure for similarity search on a set of large sparse graphs and proposes an efficient indexing mechanism by introducing the Q-Gram idea. By decomposing graphs to small grams (organized by κ-Adjacent Tree patterns) and pairing-up on those κ-Adjacent Tree patterns, the lower bound estimation of their edit distance can be calculated for candidate filtering. Furthermore, we have developed a series of techniques for inverted index construction and online query processing. By building the candidate set for the query graph before the exact edit distance calculation, the number of graphs need to proceed into exact matching can be greatly reduced. Extensive experiments on real and synthetic data sets have been conducted to show the effectiveness and efficiency of the proposed indexing mechanism.
INDEX TERMS
Graph indexing, similarity search, \kappa-adjacent tree.
CITATION
Guoren Wang, Bin Wang, Xiaochun Yang, Ge Yu, "Efficiently Indexing Large Sparse Graphs for Similarity Search", IEEE Transactions on Knowledge & Data Engineering, vol.24, no. 3, pp. 440-451, March 2012, doi:10.1109/TKDE.2010.28
REFERENCES
 [1] T.H. Cormen, "Np Completeness," Introduction to Algorithms, W. Yu, ed., second ed., vol. 7, pp. 620-630. China Machine Press, 2007. [2] E. Sutinen and J. Tarhio, "On Using q-Gram Locations in Approximate String Matching," Proc. Third Ann. European Symp. Algorithms, pp. 327-340, 1995. [3] J. Beasley and N. Christofides, "Theory and Methodology: Vehicle Routing with a Sparse Feasibility Graph," European J. Operational Research, vol. 98, no. 3, pp. 499-511, 1997. [4] R. Nallapati, A. Ahmed, W. Cohen, and E. Xing, "Sparse Word Graphs: A Scalable Algorithm for Capturing Word Correlations in Topic Models," Proc. Seventh IEEE Int'l Conf. Data Mining Workshops (ICDMW '07), pp. 343-348, 2007. [5] C. Lin, D. Jiang, and A. Zhang, "Prediction of Protein Function Using Common-Neighbors in Protein-Protein Interaction Networks," Proc. Sixth IEEE Int'l Symp. BioInformatics and BioEng. (BIBE '06), pp. 251-260, 2006. [6] X. Yan, P.S. Yu, and J. Han, "Graph Indexing: A Frequent Structure-Based Approach," Proc. ACM SIGMOD, pp. 335-345, 2004. [7] J.H. Xifeng Yan and P.S. Yu, "Graph Indexing Based on Discriminative Frequent Structure Analysis," ACM Trans. Database Systems, vol. 30, no. 4, pp. 960-993, 2005. [8] H. Shang, Y. Zhang, X. Lin, and J.X. Yu, "Taming Verification Hardness: An Efficient Algorithm for Testing Subgraph Isomorphism," Proc. 34th Int'l Conf. Very Large Data Bases, pp. 364-375, 2008. [9] X. Yan, P.S. Yu, and J. Han, "Substructure Similarity Search in Graph Databases," Proc. ACM SIGMOD, pp. 766-777, 2005. [10] J. Cheng, Y. Ke, W. Ng, and A. Lu, "Fg-Index: Towards Verification-Free Query Processing on Graph Databases," Proc. ACM SIGMOD, pp. 857-872, 2007. [11] J. Cheng, Y. Ke, and W. Ng, "Efficient Query Processing on Graph Databases," ACM Trans. Database Systems, vol. 34, no. 1, pp. 1-44, 2009. [12] C. Chen, X. Yan, and P.S. Yu, "Towards Graph Containment Search and Indexing," Proc. 33rd Int'l Conf. Very Large Data Bases, pp. 926-937, 2007. [13] S. Zhang, M. Hu, and J. Yang, "Treepi: A Novel Graph Indexing Method," Proc. IEEE 23rd Int'l Conf. Data Eng., pp. 966-975, 2007. [14] H. He and A.K. Singh, "Closure-Tree: An Index Structure for Graph Queries," Proc. 22nd Int'l Conf. Data Eng., p. 38, 2006. [15] P. Zhao, J.X. Yu, and P.S. Yu, "Graph Indexing: Tree + Delta $\ge$ Graph," Proc. 33rd Int'l Conf. Very Large Data Bases, pp. 938-949, 2007. [16] D. Eppstein, "Subgraph Isomorphism in Planar Graphs and Related Problems," J. Graph Algorithms and Applications, vol. 3, no. 3, pp. 1-27, 1999. [17] J.P. Kukluk, L.B. Holder, and D.J. Cook, "Algorithm and Experiments in Testing Planar Graphs for Isomorphism," J. Graph Algorithms and Applications, vol. 8, no. 3, pp. 313-356, 2004. [18] D.W. Williams, J. Huan, and W. Wang, "Graph Database Indexing Using Structured Graph Decomposition," Proc. 23rd Int'l Conf. Data Eng., pp. 976-985, 2007. [19] D. Shasha, J.T.-L. Wang, and R. Giugno, "Algorithmics and Applications of Tree and Graph Searching," Proc. 21st ACM SIGACT-SIGMOD-SIGART Symp. Principles of Database Systems, pp. 39-52, 2002. [20] D. Justice and A. Hero, "A Binary Linear Programming Formulation of the Graph Edit Distance," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 28, no. 8, pp. 1200-1214, Aug. 2006. [21] O. Johansson, "Graph Decomposition Using Node Labels," doctoral dissertation, Royal Inst. of Tech nology, 2001. [22] Y. Tian and J.M. Patel, "Tale: A Tool for Approximate Large Graph Matching," Proc. 24th Int'l Conf. Data Eng., pp. 963-972, 2008. [23] H. Jiang, H. Wang, P.S. Yu, and S. Zhou, "Gstring: A Novel Approach for Efficient Search in Graph Databases," Proc. 23rd Int'l Conf. Data Eng., pp. 566-575, 2007. [24] L. Zou, L. Chen, J.X. Yu, and Y. Lu, "A Novel Spectral Coding in a Large Graph Database," Proc. 11th Int'l Conf. Extending Database Technology, pp. 181-192, 2008. [25] S. Sarawagi and A. Kirpal, "Efficient Set Joins on Similarity Predicates," Proc. ACM SIGMOD, pp. 743-754, 2004. [26] M. Kuramochi and G. Karypis, "Frequent Subgraph Discovery," Proc. 2001 IEEE Int'l Conf. Data Mining, pp. 313-320, 2001.