Fourth International Conference Document Analysis and Recognition (ICDAR'97)
An interactive system to extract structured text from a geometrical representation
Ulm, GERMANY
August 18-August 20
ISBN: 0-8186-7898-4
B. Poirier, Dept. de Genie Electr., Ecole Polytech. de Montreal, Que., Canada
M. Dagenais, Dept. de Genie Electr., Ecole Polytech. de Montreal, Que., Canada
The proliferation of electronic document formats impedes the dissemination and management of documents. Indeed, a common format with structural information is required to obtain document indexing and navigation. While in some formats it is easy to decode and preserve the document structure information, often the only easily obtainable representation is Postscript, where only the geometrical information remains. Even if an organization is willing to convert all its document producing activities to a structure preserving format such as HTML, the existing documents need to be converted. The paper addresses the difficult problem of extracting the structure of a document from a geometrical representation. An interactive tool to extract the document content and structure from a geometric representation (Postscript) has been developed. It successfully analyzes several documents produced with different tools, and produces structural information using the HyperText Markup Language (HTML). The end user, when presented with the extracted document structure, can interactively modify it, if needed. The tool is easily extended to recognize new constructs and is aimed at organizations needing to convert numerous documents for searching and browsing on intranets or on the Internet.
Index Terms:
document image processing; interactive system; structured text extraction; geometrical representation; electronic document formats; common format; structural information; document indexing; document structure information; Postscript; geometrical information; structure preserving format; HTML; interactive tool; document content extraction; HyperText Markup Language; extracted document structure; intranets; Internet
Citation:
B. Poirier, M. Dagenais, "An interactive system to extract structured text from a geometrical representation," icdar, pp.342, Fourth International Conference Document Analysis and Recognition (ICDAR'97), 1997