This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
ACIRD: Intelligent Internet Document Organization and Retrieval
May/June 2002 (vol. 14 no. 3)
pp. 599-614

Abstract—This paper presents an intelligent Internet information system, Automatic Classifier for the Internet Resource Discovery (ACIRD), which uses machine learning techniques to organize and retrieve Internet documents. ACIRD consists of a knowledge acquisition process, document classifier, and two-phase search engine. The knowledge acquisition process of ACIRD automatically learns classification knowledge from classified Internet documents. The document classifier applies learned classification knowledge to classify newly collected Internet documents into one or more classes. Experimental results indicate that ACIRD performs as well or better than human experts in both knowledge acquisition and document classification. By using the learned classification knowledge and the given class lattice, the ACIRD two-phase search engine responds to user queries with hierarchically structured navigable results (instead of a conventional flat ranked document list), which greatly aids users in locating information from numerous, diversified Internet documents.

[1] R. Agrawal, T. Imielinski, and A. Swami, “Mining Association Rules Between Sets of Items in Large Databases,” Proc. 1993 ACM-SIGMOD Int'l Conf. Management of Data, pp. 207-216, May 1993.
[2] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,” Proc. 1994 Int'l Conf. Very Large Data Bases, pp. 487-499, Sept. 1994.
[3] C. Apte, F. Damerau, and S. Weiss, "Automated Learning of Decision Rules for Text Categorization," ACM Trans. Information Systems, Vol. 12, No. 3, July 1994, pp. 233-251.
[4] L.F. Chien, “PAT-Tree-Based Keyword Extraction for Chinese Information Retrieval,” Proc. ACM SIGIR Int'l Conf. Information Retrieval, 1997.
[5] P. Clark and T. Niblett, "The CN2 Induction Algorithm," Machine Learning, vol. 3, pp. 261-283, 1989.
[6] T.H. Cormen,C.E. Leiserson, and R.L. Rivest,Introduction to Algorithms.Cambridge, Mass.: MIT Press/McGraw-Hill, 1990.
[7] W.B. Croft and P. Savino, “Implementing Ranking Strategies Using Text Signatures,” ACM Trans. Office Information Systems, vol. 6, no. 1, pp. 42-62, Jan. 1998.
[8] D. Cutting and J. Pedersen, "Optimizations for Dynamic Inverted Index Maintenance," Proc. ACM SIGIR 1990, Int'l Conf. Information Retrieval, pp. 405-411, 1990.
[9] R. Duda, P. Hart, and D. Stork, Pattern Classification. New York: John Wiley&Sons, 2001.
[10] W.B. Frakes and R. Baeza-Yates, Information Retrieval Data Structures&Algorithmss.Englewood Cliffs, N.J.: Prentice Hall, 1992.
[11] N. Fuhr, “Models for Retrieval with Probabilistic Indexing,” Information Processing and Management, vol. 25, no. 1, pp. 55-72, 1989.
[12] M. Goldszmidt and M. Sahami, “A Probabilistic Approach to Full-Text Document Clustering,” Technical Report, TR ITAD-433-MS-98-044, SRI Int'l,http://robotics.stanford.edu/users/sahami/ papers-dirgm-clustering.ps. 1998.
[13] Y.F. Jing and W.B. Croft, “An Association Thesaurus for Information Retrieval,” Technical Report 94-17, Univ. of Mass.,, 1994.
[14] K.S. Jones and D.M. Jackson, “The Use of Automatically-Obtained Classifications for Information Retrieval,” Information Processing and Management (IP&M), vol. 5, pp. 175-201, 1970.
[15] K.S. Jones and R.M. Needham, “Automatic Term Classification and Retrieval,” Information Processing and Management, vol. 4, no. 1, pp. 91-100, 1968.
[16] T. Kalt and W.B. Croft, “A New Probabilistic Model of Text Classification and Retrieval,” Technical Report IR-78, Computer Science Dept., Univ. of Mass.,, 1996.
[17] L.S. Larkey and W.B. Croft, “Combining Classifiers in Text Categorization,” Proc. 19th Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 289–297, 1996.
[18] T. Berners-Lee, “Hypertext Markup Language 2.0,” http://cobar.cs.umass.edu/info/psfiles/irpubs/ jingcroftassocthes.ps.gzhttp://cobar.cs.umass.edu/ info/psfiles/irpubs/ir.htmlhttp:/ /andrew2.andrew.cmu.edu/rfcrfc1866.html , 1995.
[19] D. Lewis, “An Evaluation of Phrasal and Clustered Representations on a Text Categorization Task,” Proc. ACM Workshop Networked Information Retrieval, (SIGIR '92), pp. 37-50, 1992.
[20] D. Lewis and W. Gale, “Training Text Classifiers by Uncertainty Sampling,” Proc. ACM Workshop Networked Information Retrieval, (SIGIR '94), 1994.
[21] S.H. Lin, M.C. Chen, J.M. Ho, and Y.M. Huang, “The Design of an Automatic Classifier for Internet Resource Discovery,” Proc. Int'l Symp. Multi-Technology and Information Processing (ISMIP '96), pp. 181-188, Dec. 1996.
[22] S.H. Lin, C.S. Shih, M.C. Chen, J.M. Ho, M.T. Kao, and Y.M. Huang, “Extracting Classification Knowledge of Internet Documents: A Semantics Approach,” Proc. ACM Workshop Networked Information Retrieval, (SIGIR '98), pp. 241-249, 1998.
[23] S.H. Lin, C.S. Shih, M.C. Chen, J.M. Ho, M.T. Kao, and Y.M. Huang, “A Collaborative Internet Documents Access Scheme Using ACIRD,” Proc. Int'l Computer Symp. Software Eng. and Database Systems (ICS '98), 1998.
[24] R.S. Michalski, I. Mozetic, and J. Hong, “The AQ15 Inductive Learning System: An Overview and Experiments,” Technical Report ISG 86-20, UIUCDCS-R-86-1260, Dept. of Computer Science, Univ. of Illinois, Urbana, 1986.
[25] J. Mostafa, S. Mukhopadhyay, W. Lam, and M. Palakal, “A Multi-Level Approach to Intelligent Information Filtering: Model, Systems, and Evaluation,” ACM Trans. Information Systems, vol. 15, no. 4, pp. 368–399, 1997.
[26] J.R. Quinlan, “Induction of Decision Trees,” Machine Learning, vol. 1, pp. 261-283, 1989.
[27] J.R. Quinlan, C4.5: Programs for Machine Learning,San Mateo, Calif.: Morgan Kaufman, 1992.
[28] G. Salton, Automatic Information Organization and Retrieval. McGraw-Hill, 1968.
[29] G. Salton, C. Buckley, and C.T. Yu, “An Evaluation of Term Dependence Models in Information Retrieval,” LCNS 146, pp. 151-173, 1983.
[30] G. Salton and M. McGill, Introduction to Modern Information Retrieval, McGraw Hill, New York, 1983.
[31] G. Salton, Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer, Addison Wesley, New York, 1989.
[32] G. Salton and C. Buckley, “Improving Retrieval Performance by Relevance Feedback,” J. Am. Soc. for Information Science, vol. 41, no. 4, pp. 188-297, 1990.
[33] D. Shasha, T.-L. Wang, “New Techniques for Best-Match Retrieval,” ACM Trans. Information Systems, vol. 8, no. 2, pp. 140-158, Apr. 1990.
[34] R. Srikant and R. Agrawal, “Mining Quantitative Association Rules in Large Relational Tables,” Proc. 1996 ACM-SIGMOD Int'l Conf. Management of Data, pp. 1-12, June 1996.
[35] Y. Yang, “Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval,” Proc. 17th Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 13–22, 1994.
[36] C.T. Yu, W. Meng, and S. Park, “A Framework for Effective Retrieval,” ACM Trans. Database Systems, vol. 14, no. 2, pp. 147-167, 1989.
[37] B. Yuwono, S.L.Y. Lam, J.H. Ying, and D.L. Lee, “A World Wide Web Resource Discovery System,” World Wide Web J., vol. 1, no. 1, Winter 1996.

Index Terms:
Document classification, data mining, information retrieval, search engine
Citation:
S.-H. Lin, M.C. Chen, J.-M. Ho, Y.-M. Huang, "ACIRD: Intelligent Internet Document Organization and Retrieval," IEEE Transactions on Knowledge and Data Engineering, vol. 14, no. 3, pp. 599-614, May-June 2002, doi:10.1109/TKDE.2002.1000345
Usage of this product signifies your acceptance of the Terms of Use.