This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
TopCat: Data Mining for Topic Identification in a Text Corpus
August 2004 (vol. 16 no. 8)
pp. 949-964

Abstract—TopCat (Topic Categories) is a technique for identifying topics that recur in articles in a text corpus. Natural language processing techniques are used to identify key entities in individual articles, allowing us to represent an article as a set of items. This allows us to view the problem in a database/data mining context: Identifying related groups of items. This paper presents a novel method for identifying related items based on traditional data mining techniques. Frequent itemsets are generated from the groups of items, followed by clusters formed with a hypergraph partitioning scheme. We present an evaluation against a manually categorized ground truth news corpus; it shows this technique is effective in identifying topics in collections of news articles.

[1] Y. Kodratoff, Proc. European Conf. Machine Learning Workshop Text Mining, Apr. 1998.
[2] R. Feldman and H. Hirsh, Proc. IJCAI '99 Workshop Text Mining, Aug. 1999.
[3] D. Mladenic and M. Grobelnik, Proc. ICML-99 Workshop Machine Learning in Text Data Analysis, June 1999.
[4] R. Feldman and H. Hirsh, Exploiting Background Information in Knowledge Discovery from Text J. Intelligent Information Systems, vol. 9, no. 1, pp. 83-97, July 1998.
[5] L. Singh, P. Scheuermann, and B. Chen, Generating Association Rules from Semi-Structured Documents Using an Extended Concept Hierarchy Proc. Sixth Int'l Conf. Information and Knowledge Management, Nov. 1997.
[6] H. Ahonen, O. Heinonen, M. Klemettinen, and I. Verkamo, Mining in the Phrasal Frontier Proc. First European Symp. Principles of Data Mining and Knowledge Discovery (PKDD'97), June 1997.
[7] B. Lent, R. Agrawal, and R. Srikant, Discovering Trends in Text Databases Proc. Third Int'l Conf. Knowledge Discovery and Data Mining, pp. 227-230, Aug. 1997.
[8] O. Zamir, O. Etzioni, O. Madan, and R.M. Karp, Fast and Intuitive Clustering of Web Documents Proc. Third Int'l Conf. Knowledge Discovery and Data Mining, pp. 287-290, Aug. 1997.
[9] E.-H. S. Han, G. Karypis, and V. Kumar, Clustering Based on Association Rule Hypergraphs Proc. SIGMOD'97 Workshop Research Issues in Data Mining and Knowledge Discovery, 1997.
[10] 1998 Topic Detection and Tracking Project (TDT-2) July 1998, http://www.nist.gov/speech/tests/tdttdt98 /.
[11] R. Hyland, C. Clifton, and R. Holland, GeoNODE: Visualizing News in Geospatial Context Proc. Federal Data Mining Symp. and Exposition '99, Mar. 1999.
[12] D. Lewis, W.B. Croft, and N. Bhandaru, Language-Oriented Information Retrieval Int'l J. Intelligent Systems, vol. 4, no. 3, pp. 285-318, 1989.
[13] M.L. Mauldin, Retrieval Performance in FERRET: A Conceptual Information Retrieval System Proc. 14th Ann. Int'l ACM/SIGIR Conf. Research and Development in Information Retrieval (SIGIR '91), pp. 347-355, Oct. 1991.
[14] E. Riloff and W. Lehnert, Information Extraction as a Basis for High-Precision Text Classification ACM Trans. Information Systems, vol. 12, no. 3, pp. 296-333, 1994.
[15] D.D. Lewis and K.S. Jones, Natural Language Processing for Information Retrieval Comm. ACM, vol. 39, no. 1, pp. 92-100, 1996.
[16] K. Kageura and B. Umino, Methods of Automatic Term Recognition: A Review Terminology, vol. 3, no. 2, 1996.
[17] G. Salton and M.J. McGill, Introduction to Modern Information Retrieval. New York: McGraw-Hill, 1983.
[18] B. Hetzler, W.M. Harris, S. Havre, and P. Whitney, Visualizing the Full Spectrum of Document Relationships Structures and Relations in Knowledge Organization: Proc. Fifth Int'l ISKO Conf., pp. 168-175, 1998, http://multimedia.pnl.gov:2080/infoviz/spire spire.html.
[19] Northern Light Search Help Customer Search Folders Dec. 2001, http://www.northernlight.com/docssearch_help_folders.html .
[20] Y. Wang and M. Kitsuregawa, Link Based Clustering of Web Search Results Second Int'l Conf. Advances in Web-Age Information Management (WAIM 2001), pp. 225-236, July 2001.
[21] O. Zamir and O. Etzioni, Grouper: A Dynamic Clustering Interface to Web Search Results Proc. Eighth Int'l World Wide Web Conf., May 1999, http://www8.org/w8-papers/3a-search-query/ dynamicdynamic.html.
[22] D. Day, J. Aberdeen, L. Hirschman, R. Kozierok, P. Robinson, and M. Vilain, Mixed Initiative Development of Language Processing Systems Proc. Fifth Conf. Applied Natural Language Processing, Mar. 1997.
[23] Y. Yang and J.P. Pedersen, A Comparative Study on Feature Selection in Text Categorization Proc. 14th Int'l Conf. Machine Learning (ICML '97), July 1997, http://www.cs.cmu.edu/yiming/papers.yyml97.ps .
[24] T. Joachims, Text Categorization with Support Vector Machines: Learning with Many Relevant Features Proc. European Conf. Machine Learning, pp. 137-142, Apr. 1998.
[25] M. Porter, An Algorithm for Suffix Stripping Automated Library and Information Systems, vol. 14, no. 3, pp. 130-137, 1980.
[26] R. Cooley, Classification of News Stories Using Support Vector Machines IJCAI '99 Workshop Text Mining, Aug. 1999.
[27] G.A. Miller, C. Fellbaum, J. Kegl, and K.J. Miller, Introduction to Wordnet: An On-Line Lexical Database Int'l J. Lexicography, vol. 3, no. 4, pp. 235-244, 1990, ftp://ftp.cogsci.princeton.edu/pub/wordnet 5papers.ps.
[28] D. Harman, Overview of the First Text REtrieval Conference (TREC-1) Proc. First Text REtrieval Conf. (TREC-1), no. SN003-003-03614-5, Nat'l Inst. of Standards and Technology. Gaithersburg, Md.: Government Printing Office, pp. 1-20, Nov. 1992, http://trec.nist.gov/pubs/trec7t7_proceedings.html .
[29] G. Salton, J. Allan, and C. Buckley, Automatic Structuring and Retrieval of Large Text Files Comm. ACM, vol. 37, no. 2, pp. 97-108, Feb. 1994, .
[30] K.W. Church and P. Hanks, Word Association Norms, Mutual Information and Lexicography Computational Linguistics, vol. 16, no. 1, pp. 22-29, 1991, http://www.acm.org/pubs/citations/journals/ cacm/1994-37-2/p97-salton/http://www.research.att.com/ kwcpublished_1989_CL.ps .
[31] R. Feldman, Y. Aumann, A. Amir, A. Zilberstein, and W. Kloesgen, Maximal Association Rules: A New Tool for Mining for Keyword Co-Occurrences in Document Collections Proc. Third Int'l Conf. Knowledge Discovery and Data Mining, pp. 167-170, Aug. 1997.
[32] C. Silverstein, S. Brin, and R. Motwani, Beyond Market Baskets: Generalizing Association Rules to Dependence Rules Data Mining and Knowledge Discovery, vol. 2, no. 1, pp. 39-68, Jan. 1998.
[33] R. Agrawal, T. Imielinski, A.N. Swami, Mining Association Rules between Sets of Items in Large Databases Proc. 1993 ACM SIGMOD Int'l Conf. Management of Data, pp. 207-216, May 1993, .
[34] D. Tsur, J.D. Ullman, S. Abiteboul, C. Clifton, R. Motwani, S. Nestorov, and A. Rosenthal, Query Flocks: A Generalization of Association Rule Mining Proc. 1998 ACM SIGMOD Conf. Management of Data, pp. 1-12, June 1998.
[35] R. Agrawal and R. Srikant, Fast Algorithms for Mining Association Rules Proc. 20th Int'l Conf. Very Large Data Bases, Sept. 1994, http://www.almaden.ibm.com/cs/people/ragrawal/ papers/sigmod93.pshttp://www.vldb.org/ dblp/db/conf/vldbvldb94-487.html.
[36] R. Fano, Transmission of Information. Cambridge, Mass.: MIT Press, 1961.
[37] K. Wang, C. Xu, and B. Liu, Clustering Transactions Using Large Items Proc. Eighth Int'l Conf. Information Knowledge Management, pp. 483-490, Nov. 1999.
[38] G. Karypis, R. Aggarwal, V. Kumar, and S. Shekar, Multilevel Hypergraph Partitioning: Applications in VLSI Domain Proc. ACM/IEEE Design Automation Conf., 1997.
[39] Topic Detection and Tracking: TDT Phase 2 July 2000, http://morph.ldc.upenn.edu/ProjectsTDT2/.
[40] D.D. Lewis, Evaluating Text Categorization Proc. Speech and Natural Language Workshop, Defense Advanced Research Projects Agency, pp. 312-318, Feb. 1991.
[41] The Topic Detection and Tracking Phase 2 (TDT2) Evaluation Plan Nov. 1999, http://www.nist.gov/speech/tdt98/doctdt2.eval.plan.98.v3.7.pdf .
[42] J.M. Shultz and M. Liberman, Topic Detection and Tracking Using IDF-Weighted Cosine Coefficient Proc. 1999 DARPA Broadcast News Workshop, Feb. 1999. http://www.nist.gov/speech/publications/ darpa99/htmlabstract.htm#tdt3-10.
[43] V. Hatzivassiloglou, L. Gravano, and A. Maganti, An Investigation of Linguistic Features and Clustering Algorithms for Topical Document Clustering Proc. 23rd Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, July 2000, http://www.cs.columbia.edu/gravano/Papers/ 2000sigir00.pdf.
[44] C. Clifton, J. Griffith, and R. Holland, GeoNODE: An End-to-End System from Research Components Demonstration Section 17th Int'l Conf. Data Eng., Apr. 1991.
[45] J.D. Holt and S.M. Chung, Efficient Mining of Association Rules in Text Databases Proc. Eighth Int'l Conf. Information Knowledge Management, pp. 234-242, Nov. 1999.
[46] Topic Detection and Tracking Project (TDT) http://www.nist.gov/speech/tests/tdtindex.htm , Sept. 2000.
[47] S. Boykin and A. Merlino, Machine Learning of Event Segmentation for News on Demand Comm. ACM, vol. 43, no. 2, pp. 35-41, Feb. 2000.
[48] L. Phillips, Soft Copy Search and GeoNode Proc. Geospatial Intelligence Conf. Uniting America's Defense, Assoc. of Old Crows, Nov. 2002, http://www.crows.orgevents_conf02geospatial.htm .
[49] S. Chakrabarti, B. Dom, R. Agrawal, and P. Raghavan, Scalable Feature Selection, Classification and Signature Generation for Organizing Large Text Databases into Hierarchical Topic Taxonomies VLDB J., vol. 7, no. 3, pp. 163-178, Aug. 1998, http://www.almaden.ibm.com/cs/k53/irpapers VLDB54_3.PDF.
[50] S. Thomas, S. Bodagala, K. Alsabti, and S. Ranka, An Efficient Algorithm for the Incremental Updation of Association Rules in Large Databases Proc. Third Int'l Conf. Knowledge Discovery and Data Mining, pp. 263-266, Aug. 1997.
[51] R. Feldman, Y. Aumann, A. Amir, and H. Mannila, Efficient Algorithms for Discovering Frequent Sets in Incremental Databases Proc. Workshop Research Issues on Data Mining and Knowledge Discovery (DMKD '97), May 1997.
[52] R. Srikant and R. Agrawal, Mining Generalized Association Rules Proc. 21st Int'l Conf. Very Large Databases, Sept. 1995, http://www.almaden.ibm.com/cs/people/ragrawal pubs. html#associations.

Index Terms:
Topic detection, data mining, clustering.
Citation:
Chris Clifton, Robert Cooley, Jason Rennie, "TopCat: Data Mining for Topic Identification in a Text Corpus," IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 8, pp. 949-964, Aug. 2004, doi:10.1109/TKDE.2004.32
Usage of this product signifies your acceptance of the Terms of Use.