Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) Vol 2
Layout Based Information Extraction from HTML Documents
Curitiba, Parana, Brazil
September 23-September 26
ISBN: 0-7695-2822-8
We propose a method of information extraction from HTML documents based on modelling the visual informa- tion in the document. A page segmentation algorithm is used for detecting the document layout and subsequently, the extraction process is based on the analysis of mutual po- sitions of the detected blocks and their visual features. This approach is more robust that the traditional DOM-based methods and it opens new possibilities for the extraction task specification.