17th International Conference on Data Engineering (ICDE'01)
An Automated Change-Detection Algorithm for HTML Documents Based on Semantic Hierarchies
Heidelberg, Germany
April 02-April 06
ISBN: 0-7695-1001-9
Abstract: Data at many Web sites are changing rapidly, and a significant amount of these data are presented in HTML documents that consist of markups and data contents. Although XML is getting more popular in data exchange, the presentation of data contained in XML documents is given by and large in the HTML format using XSL(T). Since HTML was designed to "display" data from the human perspective, it is not trivial for a machine to detect (hierarchical) changes of data in an HTML document. In this paper, we propose a heuristic algorithm, called SCD, to detect semantic changes of hierarchical data contents in any two HTML documents automatically. Semantic changes differ from syntactic changes since the latter refer to changes of data contents with respect to markup structures according to the HTML grammar. SCD does not require preprocessing nor any knowledge of the internal structure of the source documents beforehand. The time complexity of SCD is O((\mid X \mid \times \mid Y\mid) log(\mid X\mid \times \mid Y\mid)), where \mid X \mid and \mid Y \mid are the number of unique branches in the syntactic hierarchies of any two given HTML documents, respectively.