The Community for Technology Leaders
2012 IEEE 8th International Conference on E-Science (2012)
Chicago, IL, USA USA
Oct. 8, 2012 to Oct. 12, 2012
ISBN: 978-1-4673-4467-8
pp: 1-6
Liana Diesendruck , National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign
Luigi Marini , National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign
Rob Kooper , National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign
Mayank Kejriwal , National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign
Kenton McHenry , National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign
ABSTRACT
Automated search of handwritten content is a highly interesting and applicative subject, especially important today due to the public availability of large digitized document collections. We describe our efforts with the National Archives (NARA) to provide searchable access to the 1940 Census data and discuss the HPC resources needed to implement the suggested framework. Instead of trying to recognize the handwritten text, a still very difficult task, we use a content based image retrieval technique known as Word Spotting. Through this paradigm, the system is queried by the use of handwritten text images instead of ASCII text and ranked groups of similar looking images are presented to the user. A significant amount of computing power is needed to accomplish the pre-processing of the data so to make this search capability available on an archive. The required preprocessing steps and the open source framework developed are discussed focusing specifically on HPC considerations that are relevant when preparing to provide searchable access to sizeable collections, such as the US Census. Having processed the state of North Carolina from the 1930 Census using 98,000 SUs we estimate the processing of the entire country for 1940 could require up to 2.5 million SUs. The proposed framework can be used to provide an alternative to costly manual transcriptions for a variety of digitized paper archives.
INDEX TERMS
Vectors, Clustering algorithms, Feature extraction, Image segmentation, Memory management, Couplings, Indexes
CITATION

L. Diesendruck, L. Marini, R. Kooper, M. Kejriwal and K. McHenry, "Digitization and search: A non-traditional use of HPC," 2012 IEEE 8th International Conference on E-Science(ESCIENCE), Chicago, IL, USA USA, 2012, pp. 1-6.
doi:10.1109/eScience.2012.6404445
89 ms
(Ver 3.3 (11022016))