loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Seventh International Conference on Document Analysis and Recognition (ICDAR'03) - Volume 2
Extraction, layout analysis and classification of diagrams in PDF documents
Edinburgh, Scotland
August 03-August 06
ISBN: 0-7695-1960-1
Robert P. Futrelle, Northeastern University
Mingyan Shao, Northeastern University
Chris Cieslik, Northeastern University
Andrea Elaina Grimes, Northeastern University
Diagrams are a critical part of virtually all scientific and technical documents. Analyzing diagrams will be important for building comprehensive document retrieval systems. This paper focuses on the extraction and classification of diagrams from PDF documents. We study diagrams available in vector (not raster) format in online research papers.
PDF files are parsed and their vector graphics components installed in a spatial index. Subdiagrams are found by analyzing white space gaps. A set of statistics is generated for each diagram, e.g., the number of horizontal lines and vertical lines. The statistics form a feature vector description of the diagram. The vectors are used in a kernel-based machine learning system (Support Vector Machine). Separating a set of bar graphs from non-bar-graphs gathered from 20,000 biology research papers gave a classification accuracy of 91.7%. The approach is directly applicable to diagrams vectorized from images.
Citation:
Robert P. Futrelle, Mingyan Shao, Chris Cieslik, Andrea Elaina Grimes, "Extraction, layout analysis and classification of diagrams in PDF documents," icdar, vol. 2, pp.1007, Seventh International Conference on Document Analysis and Recognition (ICDAR'03) - Volume 2, 2003
Usage of this product signifies your acceptance of the Terms of Use.