This Article 
 Bibliographic References 
 Add to: 
Efficient Top-k Approximate Subtree Matching in Small Memory
August 2011 (vol. 23 no. 8)
pp. 1123-1137
Nikolaus Augsten, Free University of Bozen-Bolzano, Bozen
Denilson Barbosa, University of Alberta, Edmonton
Michael M. Böhlen, University of Zurich, Zurich
Themis Palpanas, University of Trento, Trento
We consider the Top-k Approximate Subtree Matching (tasm) problem: finding the k best matches of a small query tree within a large document tree using the canonical tree edit distance as a similarity measure between subtrees. Evaluating the tree edit distance for large XML trees is difficult: the best known algorithms have cubic runtime and quadratic space complexity, and, thus, do not scale. Our solution is tasm-postorder, a memory-efficient and scalable tasm algorithm. We prove an upper bound for the maximum subtree size for which the tree edit distance needs to be evaluated. The upper bound depends on the query and is independent of the document size and structure. A core problem is to efficiently prune subtrees that are above this size threshold. We develop an algorithm based on the prefix ring buffer that allows us to prune all subtrees above the threshold in a single postorder scan of the document. The size of the prefix ring buffer is linear in the threshold. As a result, the space complexity of tasm-postorder depends only on k and the query size, and the runtime of tasm-postorder is linear in the size of the document. Our experimental evaluation on large synthetic and real XML documents confirms our analytic results.

[1] S. Guha, H.V. Jagadish, N. Koudas, D. Srivastava, and T. Yu, "Approximate XML Joins," Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 287-298, 2002.
[2] S. Melnik, H. Garcia-Molina, and E. Rahm, "Similarity Flooding: A Versatile Graph Matching Algorithm and Its Application to Schema Matching," Proc. IEEE Int'l Conf. Data Eng. (ICDE), pp. 117-128, 2002.
[3] N. Augsten, M.H. Böhlen, C.E. Dyreson, and J. Gamper, "Approximate Joins for Data-Centric XML," Proc. IEEE 24th Int'l Conf. Data Eng. (ICDE), pp. 814-823, 2008.
[4] E. Rahm and P.A. Bernstein, "A Survey of Approaches to Automatic Schema Matching," J. Very Large Data Bases (VLDB), vol. 10, no. 4, pp. 334-350, 2001.
[5] M. Weis and F. Naumann, "Dogmatix Tracks Down Duplicates in XML," Proc. ACM SIGMOD Int'l Conf. Mamagement of Data, pp. 431-442, 2005.
[6] N. Agarwal, M.G. Oliveras, and Y. Chen, "Approximate Structural Matching over Ordered XML Documents," Proc. Int'l Database Eng. and Applications Symp. (IDEAS), pp. 54-62, 2007.
[7] L. Guo, F. Shao, C. Botev, and J. Shanmugasundaram, "XRANK: Ranked Keyword Search over XML Documents," Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 16-27, 2003.
[8] K.-C. Tai, "The Tree-to-Tree Correction Problem," J. ACM, vol. 26, no. 3, pp. 422-433, 1979.
[9] K. Zhang and D. Shasha, "Simple Fast Algorithms for the Editing Distance between Trees and Related Problems," SIAM J. Computing, vol. 18, no. 6, pp. 1245-1262, 1989.
[10] I.F. Ilyas, G. Beskales, and M.A. Soliman, "A Survey of Top-$k$ Query Processing Techniques in Relational Database Systems," J. ACM Computing Surveys, vol. 40, no. 4,article no. 11, 2008.
[11] S. Amer-Yahia, N. Koudas, A. Marian, D. Srivastava, and D. Toman, "Structure and Content Scoring for XML," Proc. 31st Int'l Conf. Very Large Databases (VLDB), pp. 361-372, 2005.
[12] A. Marian, S. Amer-Yahia, N. Koudas, and D. Srivastava, "Adaptive Processing of Top-$k$ Queries in XML," Proc. 21st Int'l Conf. Data Eng. (ICDE), pp. 162-173, 2005.
[13] M. Theobald, H. Bast, D. Majumdar, R. Schenkel, and G. Weikum, "TopX: Efficient and Versatile Top-$k$ Query Processing for Semistructured Data," Int'l J. Very Large Databases (VLDB), vol. 17, no. 1, pp. 81-115, 2008.
[14] M.S. Ali, M.P. Consens, X. Gu, Y. Kanza, F. Rizzolo, and R.K. Stasiu, "Efficient, Effective and Flexible XML Retrieval Using Summaries," Proc. Fifth Int'l Workshop Initiative for the Evaluation of XML Retrieval (INEX), pp. 89-103, 2006.
[15] Z. Liu and Y. Chen, "Identifying Meaningful Return Information for XML Keyword Search," Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 329-340, 2007.
[16] R. Kaushik, R. Krishnamurthy, J.F. Naughton, and R. Ramakrishnan, "On the Integration of Structure Indexes and Inverted Lists," Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 779-790, 2004.
[17] R. Fagin, A. Lotem, and M. Naor, "Optimal Aggregation Algorithms for Middleware," J. Computer and System Sciences, vol. 66, no. 4, pp. 614-656, 2003.
[18] E.D. Demaine, S. Mozes, B. Rossman, and O. Weimann, "An Optimal Decomposition Algorithm for Tree Edit Distance," J. ACM Trans. Algorithms, vol. 6, no. 1, 2009.
[19] D. Barbosa, L. Mignet, and P. Veltri, "Studying the XML Web: Gathering Statistics from an XML Sample," J. World Wide Web, vol. 8, no. 4, pp. 413-438, 2005.
[20] R. Yang, P. Kalnis, and A.K.H. Tung, "Similarity Evaluation on Tree-Structured Data," Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 754-765, 2005.
[21] N. Augsten, M. Böhlen, and J. Gamper, "The $pq$ -Gram Distance between Ordered Labeled Trees," J. ACM Trans. Database Systems, vol. 35, no. 1, pp. 1-36, 2010.
[22] J.R. Ullmann, "An Algorithm for Subgraph Isomorphism," J. ACM, vol. 23, no. 1, pp. 31-42, 1976.
[23] Y. Tian and J.M. Patel, "TALE: A Tool for Approximate Large Graph Matching," Proc. IEEE 24th Int'l Conf. Data Eng. (ICDE), pp. 963-972, 2008.
[24] I. Tatarinov, S.D. Viglas, K. Beyer, J. Shanmugasundaram, E. Shekita, and C. Zhang, "Storing and Querying Ordered XML Using a Relational Database System," Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 204-215, 2002.
[25] A. Schmidt, F. Waas, M.L. Kersten, M.J. Carey, I. Manolescu, and R. Busse, "XMark: A Benchmark for XML Data Management," Proc. 28th Int'l Conf. Very Large Data Bases (VLDB), pp. 974-985, 2002.
[26] M. Kay, "Ten Reasons Why Saxon Xquery is Fast," IEEE Data Eng. Bull., vol. 31, no. 4, pp. 65-74, Dec. 2008.

Index Terms:
Approximate subtree matching, tree edit distance, top-k queries, XML, subtree pruning, similarity search.
Nikolaus Augsten, Denilson Barbosa, Michael M. Böhlen, Themis Palpanas, "Efficient Top-k Approximate Subtree Matching in Small Memory," IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 8, pp. 1123-1137, Aug. 2011, doi:10.1109/TKDE.2010.245
Usage of this product signifies your acceptance of the Terms of Use.