|
| This Article | ||
| ||
| Share | ||
| Bibliographic References | ||
| Add to: | ||
| | ||
| Search | ||
| ||
2011 International Conference on Document Analysis and Recognition
Facilitating Understanding of Large Document Collections
Beijing, China
September 18-September 21
ISBN: 978-0-7695-4520-2
| ASCII Text | x | ||
| Jae Hyeon Bae, Weijia Xu, Maria Esteva, "Facilitating Understanding of Large Document Collections," Document Analysis and Recognition, International Conference on, pp. 1334-1338, 2011 International Conference on Document Analysis and Recognition, 2011. | |||
| BibTex | x | ||
| @article{ 10.1109/ICDAR.2011.268, author = {Jae Hyeon Bae and Weijia Xu and Maria Esteva}, title = {Facilitating Understanding of Large Document Collections}, journal ={Document Analysis and Recognition, International Conference on}, volume = {0}, year = {2011}, issn = {1520-5363}, pages = {1334-1338}, doi = {http://doi.ieeecomputersociety.org/10.1109/ICDAR.2011.268}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, } | |||
| RefWorks Procite/RefMan/Endnote | x | ||
| TY - CONF JO - Document Analysis and Recognition, International Conference on TI - Facilitating Understanding of Large Document Collections SN - 1520-5363 SP1334 EP1338 A1 - Jae Hyeon Bae, A1 - Weijia Xu, A1 - Maria Esteva, PY - 2011 KW - density based clustering KW - information retrieval KW - distributed processing KW - Hadoop/MapReduce KW - digital archives VL - 0 JA - Document Analysis and Recognition, International Conference on ER - | |||
Large document collections containing multiple topics can be overwhelming to understand, requiring librarians and archivists significant time and efforts to develop access points. Efficient computational methods can aid this process by uncovering groups of documents that can be described for access. We investigate the use of density based clustering with document segmentation to identify points of access as dense clusters of information. The method returns stories and classes of cohesive clusters that can be described as precise points of access. We found that our method performs more efficiently than K-means clustering and topic model using Latent Dirichlet Allocation (LDA). We use Hadoop to process a large document collection.
Index Terms:
density based clustering, information retrieval, distributed processing, Hadoop/MapReduce, digital archives
Citation:
Jae Hyeon Bae, Weijia Xu, Maria Esteva, "Facilitating Understanding of Large Document Collections," icdar, pp.1334-1338, 2011 International Conference on Document Analysis and Recognition, 2011
Usage of this product signifies your acceptance of the Terms of Use.
