This Article 
 Bibliographic References 
 Add to: 
An Efficient Algorithm to Compute Differences between Structured Documents
August 2004 (vol. 16 no. 8)
pp. 965-979
Kyong-Ho Lee, IEEE Computer Society
Yoon-Chul Choy, IEEE Computer Society
Sung-Bae Cho, IEEE Computer Society

Abstract—SGML/XML are having a profound impact on data modeling and processing. This paper presents an efficient algorithm to compute differences between old and new versions of an SGML/XML document. The difference between the two versions can be considered to be an edit script that transforms one document tree into another. The proposed algorithm is based on a hybridization of bottom-up and top-down methods: The matching relationships between nodes in the two versions are produced in a bottom-up manner and then the top-down breadth-first search computes an edit script. Faster matching is achieved because the algorithm does not need to investigate the possible existence of matchings for all nodes. Furthermore, it can detect structurally meaningful changes such as the movement and copy of a subtree as well as simple changes to the node itself like insertion, deletion, and update.

[1] ISO/IEC 8879, Information Processing Text and Office Systems Standard Generalized Markup Language (SGML), Int'l Organization for Standardization, 1986.
[2] W3C Recommendation REC-xml-200001006, Extensible Markup Language (XML) 1.0, second ed., World Wide Web Consortium, 2000,
[3] A. Haake, CoVer: A Contextual Version Server for Hypertext Applications Proc. Fourth ACM Conf. Hypertext, pp. 43-52, Nov. 1992.
[4] K. Osterbye, Structural and Cognitive Problems in Providing Version Control for Hypertext Proc. Fourth ACM Conf. Hypertext, pp. 33-42, Nov. 1992.
[5] H. Moeller, Versioning Structured Technical Document Proc. Workshop Versioning in Hypertext Systems, Sept. 1994.
[6] P.B. Berra, S.J. Yoo, Y.K. Lee, and K. Yoon, Version Management in Structured Document Retrieval Systems Proc. Eighth Int'l Conf. Software Eng. and Knowledge Eng., pp. 537-544, June 1996.
[7] M.A. Noronha, L.G. Golendziner, and C.S. dos Santos, Extending a Structured Document Model with Version Control Proc. Int'l Database Eng. and Applications Symp., pp. 234-243, July 1998.
[8] W. Labio and H.G. Molina, Efficient Snapshot Differential Algorithms for Data Warehousing Proc. 20th Conf. Very Large Databases, pp. 63-74, Sept. 1996.
[9] J. Widom and S. Ceri, Active Database Systems: Triggers and Rules for Advanced Database Processing. Morgan Kaufmann, 1995.
[10] R. Wagner and M. Fischer, The String-to-String Correction Problem J. Assoc. of Computing Machinery, vol. 21, no. 1, pp. 168-173, Jan. 1974.
[11] R. Wagner, On the Complexity of the Extended String-to-String Correction Problem Proc. Seventh ACM Symp. Theory of Computation, 1975.
[12] E. Myers, An O(ND) Difference Algorithm and Its Variations Algorithmica, vol. 1, no. 2, pp. 251-266, 1986.
[13] S. Wu, U. Manber, G. Myers, and W. Miller, An O(NP) Sequence Comparison Algorithm Information Processing Letters, vol. 35, pp. 317-323, Sept. 1990.
[14] S. Ghandeharizadeh, R. Hull, D. Jacobs, J. Castillo, M.E. Molano, S.H. Lu, J. Luo, C. Tsang, and G. Zhou, On Implementing a Language for Specifying Active Database Execution Models Proc. 19th Int'l Conf. Very Large Data Bases, pp. 441-454, Aug. 1993.
[15] K. Zhang and D. Shasha, Simple Fast Algorithms for the Editing Distance between Trees and Related Problems SIAM J. Computing, vol. 18, no. 6, pp. 1245-1262, 1989.
[16] D. Shasha and K. Zhang, Fast Algorithms for the Unit Cost Editing Distance between Trees J. Algorithms, vol. 11, pp. 581-621, 1990.
[17] K. Zhang, Algorithms for the Constrained Editing Distance between Ordered Labeled Trees and Related Problems Pattern Recognition, vol. 28, no. 3, pp. 463-474, Mar. 1995.
[18] G.J.S. Chang, G. Patel, L. Relihan, and J.T.L. Wang, A Graphical Environment for Change Detection in Structured Documents Proc. 21st Ann. Int'l Computer Software and Applications Conf. (COMPSAC), pp. 536-541, Aug. 1997.
[19] J.T.L. Wang, D. Shasha, G.J.S. Chang, L. Relihan, K. Zhang, and G. Patel, Structural Matching and Discovery in Document Databases Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 560-563, May 1997.
[20] S. Chawathe, A. Rajaraman, H.G. Molina, and J. Wisdom, Change Detection in Hierarchically Structured Information Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 493-504, June 1996.
[21] S. Chawathe and H.G. Molina, Meaningful Change Detection in Structured Data Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 26-37, May 1997.
[22] C.F. Goldfarb, The SGML Handbook. Oxford: Clarendon Press, 1990.
[23] M. Bryan, SGML: An Author's Guide to the Standard Generalized Markup Language. Addison-Wesley, 1988.
[24] H. Brown, Standards for Structured Documents The Computer J., vol. 32, no. 6, pp. 505-514, 1989.
[25] C.F. Goldfarb and P. Prescod, The XML Handbook. Prentice Hall, 1998.
[26] ISO 12083, Information and Documentation Electronic Manuscript Preparation and Markup, Int'l Organization for Standardization, 1994.
[27] K. Zhang, J.T.L. Wang, and D. Shasha, On the Editing Distance between Undirected Acyclic Graphs Int'l J. Foundations of Computer Science, vol. 7, no. 1, pp. 43-58, 1996.
[28] G. Cobéna, S. Abiteboul, and A. Marian, Detecting Changes in XML Documents Proc. 18th Int'l Conf. Data Eng. (ICDE'02), pp. 41-52, Feb. 2002.
[29] Y. Wang, D.J. Dewitt, and J.-Y. Cai, X-Diff: An Effective Change Detection Algorithm for XML Documents Proc. 19th Int'l Conf. Data Eng. (ICDE '03), Mar. 2003.

Index Terms:
Change detection, difference computation, edit script, edit operation, structured document, SGML, XML.
Kyong-Ho Lee, Yoon-Chul Choy, Sung-Bae Cho, "An Efficient Algorithm to Compute Differences between Structured Documents," IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 8, pp. 965-979, Aug. 2004, doi:10.1109/TKDE.2004.19
Usage of this product signifies your acceptance of the Terms of Use.