Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) Vol 1 Document Content Inventory and Retrieval Curitiba, Parana, Brazil September 23-September 26 ISBN: 0-7695-2822-8
We give an analysis of relationships between expected retrieval performance and classification recognition accu- racy in the context of document image content extraction and inventory. By content extraction we mean location and measurement of regions containing handwriting, machine- printed text, photographs, blank space, etc, in documents represented as bilevel, grey-level, or color images. Recent experiments have shown that even modest per-pixel content classification accuracies can support usefully high recall and precision rates (of, e.g., 80?90%) for retrieval queries within document collections seeking pages that contain a given minimum fraction of a certain type of content. In an effort to elucidate this interesting empirical result, we have analyzed the interdependency of classification and re- trieval under a variety of assumptions about the distribution of content types in document image collections. We show that under general conditions we cannot derive precision and recall measures from per-pixel classification measures alone, but we can estimate the expected values of these mea- sures. If however the distribution of content and error rates are uniform across the entire collection, our results suggest, it is possible to predict precision and recall measures from classification accuracy and vice versa.
Citation:
H. Baird, M. Moll, "Document Content Inventory and Retrieval," icdar, vol. 1, pp.93-97, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) Vol 1, 2007 Usage of this product signifies your acceptance of the Terms of Use. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||