Eighth International Conference on Document Analysis and Recognition (ICDAR'05)
Table Structure Analysis Based on Cell Classification and Cell Modification for XML Document Transformation
Seoul, Korea
August 31-September 01
ISBN: 0-7695-2420-6
A new method of table structure analysis based on cell classifi- cation and cell modification is proposed in this paper as the basis of an OCR which can convert a variety of printed tables into XML documents in accordance with a specified XML schema. The outline of this method is described as follows. Firstly, cell features de- fined by ruled lines, which correspond to data fields, are extracted from the input image of a table. After that, each cell is classified to identify the irregular table whose ruled lines are not gridded and is modified to form regular cell arrangement. Next, the hierarchical table structure consisting of a regular row structure of cells is extracted from the modified regular table and is described using a DOM tree. In this case, logical objects within a cell are extracted and are converted into a sub-tree in the DOM tree. Finally, this DOM tree is transformed into a target XML document by an XML parser with information extraction process. Experimental results show the method is effective in transforming various printed tables to various XML documents.
Citation:
Yasuto ISHITANI, Kosei FUME, Kazuo SUMITA, "Table Structure Analysis Based on Cell Classification and Cell Modification for XML Document Transformation," icdar, pp.1247-1252, Eighth International Conference on Document Analysis and Recognition (ICDAR'05), 2005