The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.08 - August (2004 vol.16)
pp: 949-964
ABSTRACT
<p><b>Abstract</b>—TopCat (Topic Categories) is a technique for identifying topics that recur in articles in a text corpus. Natural language processing techniques are used to identify key entities in individual articles, allowing us to represent an article as a set of items. This allows us to view the problem in a database/data mining context: Identifying related groups of items. This paper presents a novel method for identifying related items based on traditional data mining techniques. Frequent itemsets are generated from the groups of items, followed by clusters formed with a hypergraph partitioning scheme. We present an evaluation against a manually categorized ground truth news corpus; it shows this technique is effective in identifying topics in collections of news articles.</p>
INDEX TERMS
Topic detection, data mining, clustering.
CITATION
Chris Clifton, Robert Cooley, Jason Rennie, "TopCat: Data Mining for Topic Identification in a Text Corpus", IEEE Transactions on Knowledge & Data Engineering, vol.16, no. 8, pp. 949-964, August 2004, doi:10.1109/TKDE.2004.32
26 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool