loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
First International Workshop on Document Image Analysis for Libraries (DIAL'04)
Xed: A New Tool for eXtracting Hidden Structures from Electronic Documents
Palo Alto, California
January 23-January 24
ISBN: 0-7695-2088-X
Karim Hadjar, University of Fribourg
Maurizio Rigamonti, University of Fribourg
Denis Lalanne, University of Fribourg
Rolf Ingold, University of Fribourg

PDF became a very common format for exchanging printable documents. Further, it can be easily generated from the major documents formats, which make a huge number of PDF documents available over the net. However its use is limited to displaying and printing, which considerably reduces the search and retrieval capabilities. For this reason, additional tools have recently appeared that allow to extract the textual content. However their practical use is limited in the sense that the text?s reading order is not necessary preserved, especially when handling multi-column documents, or in presence of complex layout. Our thesis is that those tools do not consider the hidden layout and logical structures of documents, which could greatly improve their results.

We propose a novel approach to overcome the document content extraction, by merging a) low-level extraction methods applied on PDF files with b) layout analysis performed on a synthetically generated TIFF image. The paper describes the various steps necessary to achieve this task. Finally, we present a first experiment on the restitution of the newspapers? reading order which shows encouraging results.

Citation:
Karim Hadjar, Maurizio Rigamonti, Denis Lalanne, Rolf Ingold, "Xed: A New Tool for eXtracting Hidden Structures from Electronic Documents," dial, pp.212, First International Workshop on Document Image Analysis for Libraries (DIAL'04), 2004
Usage of this product signifies your acceptance of the Terms of Use.