This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Indexing Useful Structural Patterns for XML Query Processing
July 2005 (vol. 17 no. 7)
pp. 997-1009
Queries on semistructured data are hard to process due to the complex nature of the data and call for specialized techniques. Existing path-based indexes and query processing algorithms are not efficient for searching complex structures beyond simple paths, even when the queries are high-selective. We introduce the definition of minimal infrequent structures (MIS), which are structures that 1) exist in the data, 2) are not frequent with respect to a support threshold, and 3) all substructures of them are frequent. By indexing the occurrences of MIS, we can efficiently locate the high-selective substructures of a query, improving search performance significantly. An efficient data mining algorithm is proposed, which finds the minimal infrequent structures. Their occurrences in the XML data are then indexed by a lightweight data structure and used as a fast filter step in query evaluation. We validate the efficiency and applicability of our methods through experimentation on both synthetic and real data.

[1] A. Aboulnaga, A. Alameldeen, and J. Naughton, “Estimating the Selectivity of XML Path Expressions for Internet Scale Applications,” Proc. Very Large Data Bases Conf., 2001.
[2] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,” Proc. Very Large Data Bases Conf., 1994.
[3] T. Asai, K. Abe, S. Kawasoe, H. Arimura, and H. Sakamoto, “Efficient Substructure Discovery from Large Semi-Structured Data,” Proc. Ann. SIAM Symp. Data Mining, 2002.
[4] N. Bruno, N. Koudas, and D. Srivastava, “Holistic Twig Joins: Optimal XML Pattern Matching,” Proc. ACM SIGMOD Conf., 2002.
[5] Z. Chen, H. Jagadish, F. Korn, N. Koudas, S. Muthukrishnan, R. Ng, and D. Srivastava, “Counting Twig Matches in a Tree,” Proc. Int'l Conf. Data Eng., 2001.
[6] Q. Chen, A. Lim, and K.W. Ong, “D(k)-Index: An Adaptive Structural Summary for Graph-Structured Data,” Proc. ACM SIGMOD Conf., 2003.
[7] D.W. Cheung, J. Han, V. Ng, and C.Y. Wong, “Maintenance of Discovered Association Rules in Large Databases: An Incremental Updating Techniques,” Proc. Int'l Conf. Data Eng., 1996.
[8] C.W. Chung, J.K. Min, and K. Shim, “APEX: An Adaptive Path Index for XML Data,” Proc. ACM SIGMOD Conf., 2002.
[9] L. Dehaspe, H. Toivonen, and R.D. King, “Finding Frequent Substructures in Chemical Compounds,” Proc. Knowledge Discovery and Data Mining (KDD) Conf., 1998.
[10] R. Goldman and J. Widom, “Approximate DataGuides,” Proc. Workshop Query Processing for Semistructured Data and Non-Standard Data Formats, 2000.
[11] A. Guttman, “R-Trees: A Dynamic Index Structure for Spatial Searching,” Proc. ACM SIGMOD Conf., 1984.
[12] H. Jiang, W. Wang, H. Lu, and J.X. Yu, “Holistic Twig Joins on Indexed XML Documents,” Proc. Very Large Data Bases Conf., 2003.
[13] R. Kaushik, P. Shenoy, P. Bohannon, and E. Gudes, “Exploiting Local Similarity for Indexing Paths in Graph-Structured Data,” Proc. Int'l Conf. Data Eng., 2002.
[14] R. Kaushik, P. Bohannon, J.F. Naughton, and H.F. Korth, “Covering Indexes for Branching Path Queries,” Proc. ACM SIGMOD Conf., 2002.
[15] S. Al-Khalifa, H.V. Jagadish, N. Koudas, J.M. Patel, D. Srivastava, and Y.Q. Wu, “Structural Joins: A Primitive for Efficient XML Query Pattern Matching,” Proc. Int'l Conf. Data Eng., 2002.
[16] M. Kuramochi and G. Karypis, “Frequent Subgraph Discovery,” Proc. Int'l Conf. Data Mining (ICDM), 2001.
[17] L. Lim, M. Wang, S. Padmanabhan, J. Vitter, and R. Parr, “XPathLearner: An On-Line Self-Tuning Markov Histogram for XML Path Selectivity Estimation,” Proc. Very Large Data Bases Conf., 2002.
[18] N. Mamoulis, D.W. Cheung, and W. Lian, “Similarity Search in Sets and Categorical Data Using the Signature Tree,” Proc. Int'l Conf. Data Eng., 2003.
[19] J. McHugh, S. Abiteboul, R. Goldman, D. Quass, and J. Widom, “Lore: A Database Management System for Semistructured Data,” SIGMOD Record, vol. 26, no. 3, pp. 54-66, 1997.
[20] T. Milo and D. Suciu, “Index Structures for Path Expressions,” Proc. Int'l Conf. Database Theory, 1999.
[21] N. Polyzotis and M. Garofalakis, “Statistical Synopses for Graph-Structured XML Databases,” Proc. ACM SIGMOD Conf., 2002.
[22] S.M. Selkow, “The Tree-to-Tree Editing Problem,” Information Processing Letters, vol. 6, no. 6, pp. 184-186, 1977.
[23] J. Shanmugasundaram, K. Tufte, G. He, C. Zhang, D. DeWitt, and J. Naughton, “Relational Databases for Querying XML Documents: Limitations and Opportunities,” Proc. Very Large Data Bases Conf., 1999.
[24] T. Shimura, M. Yoshikawa, and S. Uemura, “Storage and Retrieval of XML Documents Using Object-Relational Databases,” Proc. Int'l Conf. Database and Expert Systems Applications (DEXA), 1999.
[25] K. Wang and H. Liu, “Discovering Structural Association of Semistructured Data,” IEEE Trans. Knowledge and Data Eng., vol. 12, no. 3, pp. 353-371, May/June 2000.
[26] H. Wang, S. Park, W. Fan, and P.S. Yu, “Vist: Virtual Suffix Tree for XML Indexing,” Proc. ACM SIGMOD Conf., 2003.
[27] L.H. Yang, M.L. Lee, and W. Hsu, “Efficient Mining of XML Query Patterns for Caching,” Proc. Very Large Data Bases Conf., 2003.
[28] M.J. Zaki, “Efficiently Mining Frequent Trees in a Forest,” Proc. SIGKDD Conf., 2002.
[29] C. Zhang, J. Naughton, D. Dewitt, Q. Luo, and G. Lohman, “On Supporting Containment Queries in Relational Database Management Systems,” Proc. ACM SIGMOD Conf., 2001.
[30] DBLP XML Records, http://www.acm.org/sigmod/dblp/dbindex.html , Feb. 2001.
[31] International Press Telecommunications Council, News Industry Text Format (NITF), http:/www.nift.org, 2000.

Index Terms:
Index Terms- Query processing, XML/XSL/RDF, mining methods and algorithms, document indexing.
Citation:
Wang Lian, Nikos Mamoulis, David Wai-lok Cheung, S.M. Yiu, "Indexing Useful Structural Patterns for XML Query Processing," IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 7, pp. 997-1009, July 2005, doi:10.1109/TKDE.2005.110
Usage of this product signifies your acceptance of the Terms of Use.