The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.02 - February (2012 vol.24)
pp: 197-208
Zhiwei Lin , University of Ulster at Jordanstown, United Kingdom
Hui Wang , University of Ulster at Jordanstown, United Kingdom
Sally McClean , University of Ulster at Coleraine, United Kingdom
ABSTRACT
Tree is one of the most common and well-studied data structures in computer science. Measuring the similarity of such structures is key to analyzing this type of data. However, measuring tree similarity is not trivial due to the inherent complexity of trees and the ensuing large search space. Tree kernel, a state of the art similarity measurement of trees, represents trees as vectors in a feature space and measures similarity in this space. When different features are used, different algorithms are required. Tree edit distance is another widely used similarity measurement of trees. It measures similarity through edit operations needed to transform one tree to another. Without any restrictions on edit operations, the computation cost is too high to be applicable to large volume of data. To improve efficiency of tree edit distance, some approximations were introduced into tree edit distance. However, their effectiveness can be compromised. In this paper, a novel approach to measuring tree similarity is presented. Trees are represented as multidimensional sequences and their similarity is measured on the basis of their sequence representations. Multidimensional sequences have their sequential dimensions and spatial dimensions. We measure the sequential similarity by the all common subsequences sequence similarity measurement or the longest common subsequence measurement, and measure the spatial similarity by dynamic time warping. Then we combine them to give a measure of tree similarity. A brute force algorithm to calculate the similarity will have high computational cost. In the spirit of dynamic programming two efficient algorithms are designed for calculating the similarity, which have quadratic time complexity. The new measurements are evaluated in terms of classification accuracy in two popular classifiers (k-nearest neighbor and support vector machine) and in terms of search effectiveness and efficiency in k-nearest neighbor similarity search, using three different data sets from natural language processing and information retrieval. Experimental results show that the new measurements outperform the benchmark measures consistently and significantly.
INDEX TERMS
All common subsequences, approximate tree edit distance, dynamic time warping, the longest common subsequence, tree edit distance, tree kernel, tree similarity.
CITATION
Zhiwei Lin, Hui Wang, Sally McClean, "A Multidimensional Sequence Approach to Measuring Tree Similarity", IEEE Transactions on Knowledge & Data Engineering, vol.24, no. 2, pp. 197-208, February 2012, doi:10.1109/TKDE.2010.239
REFERENCES
[1] F. Aiolli, G. Da San Martino, and A. Sperduti, "Route Kernels for Trees," Proc. 26th Ann. Int'l Conf. Machine Learning, pp. 17-24, 2009.
[2] J. Allali and M.-F. Sagot, "Novel Tree Edit Operations for rna Secondary Structure Comparison," Lecture Notes in Computer Science, Springer-Verlag, pp. 412-425, 2004.
[3] N. Augsten, M. Böhlen, and J. Gamper, "Approximate Matching of Hierarchical Data Using PQ-Grams," Proc. 31st Int'l Conf. Very Large Data Bases, pp. 301-312, 2005.
[4] C. Bahlmann, B. Haasdonk, and H. Burkhardt, "On-line Handwriting Recognition with Support Vector Machines" a Kernel Approach," Proc. Eighth IEEE Int'l Workshop Frontiers in Handwriting Recognition (IWFHR '02), p. 49, 2002.
[5] P. Bille, "A Survey on Tree Edit Distance and Related Problems," Theoretical Computer Science, vol. 337, nos. 1-3, pp. 217-239, 2005.
[6] W.A. Chaovalitwongse and P.M. Pardalos, "On the Time Series Support Vector Machine Using Dynamic Time Warping Kernel for Brain Activity Classification," Cybernetics and Systems Analysis, vol. 44, no. 1, pp. 125-138, 2008.
[7] S.S. Chawathe, "Comparing Hierarchical Data in External Memory," Proc. 25th Int'l Conf. Very Large Data Bases, pp. 90-101, 1999.
[8] W. Che, M. Zhang, A. Aw, C. Tan, T. Liu, and S. Li, "Using a Hybrid Convolution Tree Kernel for Semantic Role Labeling," ACM Trans. Asian Language Information Processing (TALIP), vol. 7, no. 4, pp. 1-23, 2008.
[9] W. Chen, "New Algorithm for Ordered Tree-to-Tree Correction Problem," J. Algorithms, vol. 40, no. 2, pp. 135-158, 2001.
[10] M. Collins and N. Duffy, "Convolution Kernels for Natural Language," Proc. Conf. Advances in Neural Information Processing Systems 14, pp. 625-632, 2001.
[11] T. Dalamagas, T. Cheng, K.-J. Winkel, and T. Sellis, "A Methodology for Clustering xml Documents by Structure," Information System, vol. 31, no. 3, pp. 187-228, 2006.
[12] B. David, "A Short Survey of Document Structure Similarity Algorithms," Proc. Fifth Int'l Conf. Internet Computing, pp. 3-9, 2004.
[13] E.D. Demaine, S. Mozes, B. Rossman, and O. Weimann, "An Optimal Decomposition Algorithm for Tree Edit Distance," ACM Trans. Algorithms, vol. 6, no. 1, pp. 1-19, 2009.
[14] C. Elzinga, S. Rahmann, and H. Wang, "Algorithms for Subsequence Combinatorics," Theoretical Computer Science, vol. 409, no. 3, pp. 394-404, 2008.
[15] M. Garofalakis and A. Kumar, "Xml Stream Processing using Tree-edit Distance Embeddings," ACM Trans. Database Systems, vol. 30, no. 1, pp. 279-332, 2005.
[16] T. Gärtner, "A Survey of Kernels for Structured Data," SIGKDD. Explorations, vol. 5, no. 1, pp. 49-58, 2003.
[17] S. Gudmundsson, T. Runarsson, Philip, and S. Sigurdsson, "Support Vector Machines and Dynamic Time Warping for Time Series," Proc. IEEE Int'l Joint Conf. Neural Networks, pp. 2772-2776, 2008.
[18] S. Guha, H.V. Jagadish, N. Koudas, D. Srivastava, and T. Yu, "Approximate xml Joins," Proc. 2002 ACM SIGMOD Int'l Conf. Management of Data, pp. 287-298, 2002.
[19] D. Haussler, "Convolution Kernels on Discrete Structures," Technical Report UCSC-CRL-99-10, Dept. of Computer Science, Univ. of California at Santa Cruz, 1999.
[20] S. Hiroshi, N. Ken-ichi, N. Mitsuru, and S. Shigeki, "Dynamic Time-alignment Kernel in Support Vector Machine," Proc. Conf. Neural Information Processing Systems: Natural and Synthetic (NIPS '01), 2001.
[21] D.S. Hirschberg, "A Linear Space Algorithm for Computing Maximal Common Subsequences," Comm. ACM, vol. 18, no. 6, pp. 341-343, 1975.
[22] H. Kashima and T. Koyanagi, "Kernels for Semi-Structured Data," Proc. 19th Int'l Conf. Machine Learning, pp. 291-298, 2002.
[23] E. Keogh, "Exact Indexing of Dynamic Time Warping," Proc. 28th Int'l Conf. Very Large Data Bases, pp. 406-417, 2002.
[24] P.N. Klein, "Computing the Edit-Distance between Unrooted Ordered Trees," Proc. Sixth Ann. European Symp. Algorithms, pp. 91-102, 1998.
[25] H. Lei and B. Sun, "A Study on the Dynamic Time Warping in Kernel Machines," Proc. 2007 Third Int'l IEEE Conf. Signal-Image Technologies and Internet-Based System, pp. 839-845, 2007.
[26] D. Lemire, "Faster Retrieval with a Two-Pass Dynamic-time-Warping Lower Bound," Pattern Recognition, vol. 42, no. 9, pp. 2169-2180, 2009.
[27] C. Leslie, E. Eskin, A. Cohen, J. Weston, and W.S. Noble, "Mismatch String Kernels for SVM Protein Classification," Neural Information Processing Systems, vol. 15, pp. 1441-1448, 2003.
[28] V. Levenshtein, "Binary Codes Capable of Correcting Deletions, Insertions and Reversals," Soviet Physics Doklady, vol. 10, p. 707, 1966.
[29] A. Moschitti, "Efficient Convolution Kernels for Dependency and Constituent Syntactic Trees," Proc. 17th European Conf. Machine Learning, pp. 318-329, Sept. 2006.
[30] A. Moschitti, "Making Tree Kernels Practical for Natural Language Learning," Proc. Eleventh Int'l Conf. European Assoc. Computational Linguistics, 2006.
[31] A. Moschitti, D. Pighin, and R. Basili, "Tree Kernels for Semantic Role Labeling," Computational Linguistics, vol. 34, no. 2, pp. 193-224, 2008.
[32] M. Neuhaus and H. Bunke, "Edit Distance-Based Kernel Functions for Structural Pattern Classification," Pattern Recognition, vol. 39, no. 10, pp. 1852-1863, 2006.
[33] H. Sakoe, "Dynamic Programming Algorithm Optimization for Spoken Word Recognition," IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 26, pp. 43-49, 1978.
[34] T. Seidl and H.-P. Kriegel, "Optimal Multi-Step k-Nearest Neighbor Search," ACM SIGMOD Record, vol. 27, no. 2, pp. 154-165, 1998.
[35] S.M. Selkow, "The Tree-to-Tree Editing Problem," Information Processing Letters, vol. 6, no. 6, pp. 184-186, 1977.
[36] K.-C. Tai, "The Tree-to-Tree Correction Problem," J. ACM, vol. 26, no. 3, pp. 422-433, 1979.
[37] Y. Takahashi, Y. Satoh, H. Suzuki, and S. ichi Sasaki, "Recoginition of Largest Common Structural Fragmaent among a Variety of Chemical Structures," Analytical Sciences, vol. 3, pp. 23-28, 1987.
[38] A. Tatsuya, F. Daiji, and A. Takasu, "Approximating Tree Edit Distance through String Edit Distance," Algorithmica, vol. 57, no. 2, pp. 325-348, 2010.
[39] K. Tetsuji, H. Kouichi, K. Hisashi, F.-K. Kiyoko, and Y. Hiroshi, "A Spectrum Tree Kernel," Trans. Japanese Soc. Artificial Intelligence, vol. 22, no. 2, pp. 140-147, 2007.
[40] H. Touzet, "Comparing Similar Ordered Trees in Linear-Time," J. Discrete Algorithms, vol. 5, no. 4, pp. 696-705, 2007.
[41] H. Wang, "All Common Subsequences," Proc. 20th Int'l Joint Conf. Artifical Intelligence, pp. 635-640, 2007.
[42] R. Yang, P. Kalnis, and A.K.H. Tung, "Similarity Evaluation on Tree-structured Data," Proc. 2005 ACM SIGMOD Int'l Conf. Management of Data, pp. 754-765, 2005.
[43] D. Zhang and W.S. Lee, "Question Classification Using Support Vector Machines," Proc. 26th Ann. Int'l ACM SIGIR Conf. Research and Development in Informaion Retrieval, pp. 26-32, 2003.
[44] K. Zhang and D. Shasha, "Simple Fast Algorithms for the Editing Distance between Trees and Related Problems," SIAM J. Computing, vol. 18, no. 6, pp. 1245-1262, 1989.
16 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool