The Community for Technology Leaders
String Processing and Information Retrieval, International Symposium on (1999)
Cancun, Mexico
Sept. 21, 1999 to Sept. 24, 1999
ISBN: 0-7695-0268-7
pp: 73
Alexander Gelbukh , National Polytechnic Institute, Mexico City, Mexico.
Grigori Sidorov , National Polytechnic Institute, Mexico City, Mexico.
Adolfo Guzmán-Arenas , National Polytechnic Institute, Mexico City, Mexico.
ABSTRACT
Given a large hierarchical dictionary of concepts, the task of selection of the concepts that describe the contents of a given document is considered. The problem consists in proper handling of the top-level concepts in the hierarchy. As a representation of the document, a histogram of the topics with their respective contribution in the document is used. The contribution is determined by comparison of the document with the "ideal" document for each topic in the dictionary. The "ideal" document for a concept is one that contains only the keywords belonging to this concept, in the proportion to their occurrences in the training corpus. A fast algorithm of comparison for some types of metrics is proposed. The application of the method in a system Classifier is discussed.
INDEX TERMS
Document classification, topic detection, document comparison metrics, natural language processing, information retrieval
CITATION
Alexander Gelbukh, Grigori Sidorov, Adolfo Guzmán-Arenas, "A Method of Describing Document Contents through Topic Selection", String Processing and Information Retrieval, International Symposium on, vol. 00, no. , pp. 73, 1999, doi:10.1109/SPIRE.1999.796580
98 ms
(Ver 3.3 (11022016))