This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Document Processing for Automatic Knowledge Acquisition
February 1994 (vol. 6 no. 1)
pp. 3-21

The knowledge acquisition bottleneck has become the major impediment to the development and application of effective information systems. To remove this bottleneck, new document processing techniques must be introduced to automatically acquire knowledge from various types of documents. By presenting a survey on the techniques and problems involved, this paper aims at serving as a catalyst to stimulate research in automatic knowledge acquisition through document processing. In this study, a document is considered to have two structures: geometric structure and logical structure. These play a key role in the process of the knowledge acquisition, which can be viewed as a process of acquiring the above structures. Extracting the geometric structure from a document refers to document analysis; mapping the geometric structure into logical structure is regarded as document understanding. Both areas are described in this paper, and the basic concept of document structure and its measurement based on entropy analysis is introduced. Logical structure and geometric models are proposed. Both top-down and bottom-up approaches and their entropy analyses are presented. Different techniques are discussed with practical examples. Mapping methods, such as tree transformation, document formatting knowledge and document format description language, are described.

[1] L. Abele, F. Wahl, and W. Scheri, "Procedures for an automatic segmentation of text graphic and halftone regions in document," inProc. 2nd Scandinavian Conf. Image Analysis, 1981, pp. 177-182.
[2] T. Akiyama and I. Masuda, "A segmentation method for document images without the knowledge of document formats,"Trans. Japan. Inst. Electron. Commun. Engineers, vol. J66-D, no. 2, pp. 111-118, 1982 (in Japanese).
[3] T. Akiyama and N. Hagita, "Automated entry system for printed documents,"Pattern Recogn., vol. 23, no. 11, pp. 1141-1153, 1990.
[4] R. N. Ascher, G. M. Koppelman, M. J. Miller, G. Nagy, and G. L. Shelton, Jr., "An interactive system for reading unformatted printed text,"IEEE Trans. Comput., vol. C-20, no. 12, pp. 1527-1543, 1971.
[5] N. Bartneck, "Knowledge based address block finding using hybrid knowledge representation schemes," inProc. 3rd USPS Advanced Technology Conf., pp. 249-263, 1988.
[6] N. J. Belkin, H. M. Brooks, and P. J. Daniels, "Knowledge elicitation using discourse analysis,"Int. J. Man-Machine Studies, vol. 27, pp. 127-144, 1987.
[7] A. Bergman, E. Bracha, P. G. Mulgaonkar, and T. Shaham, "Advanced research in address block location," inProc. 3rd USPS Advanced Technology Conf., pp. 218-232, 1988.
[8] R. Beyth-Marom and S. Dekel,An Elementary Approach to Thinking Under Uncertainty, translated and adapted by S. Lichtenstein, B. Marom, and R. Beyth-Marom. Hillsdale, NJ: Lawrence Erlbaum, 1985.
[9] W. P. Birmingham and D. P. Siewiorek, "Automated knowledge acquisition for a computer hardware systhesis system,"Knowledge Acquisition, vol. 1, no. 4, pp. 321-340, 1989.
[10] J. P. Bixler, "Tracking text in mixed-mode document," inProc. ACM Conf. Document Processing Systems. 1988, pp. 177-185.
[11] J. H. Boose and J. M. Bradshaw, "Expertise transfer and complex problems: Using Aquinas as a knowledge-acquisition workbench for knowledge-based systems,"Int. J. Man-Machine Studies, vol. 26, pp. 3-28, 1987.
[12] J. H. Boose and J. M. Bradshaw, "AQUINAS: A knowledge acquisition workbench for building knowledge-based system," inProc. 1st European Workshop on Knowledge Acquisition for Knowledge-Based Systems. Reading Univ., Sept. 1987, pp. A6, 1-6.
[13] J. H. Boose, "A survey of knowledge acquisition techniques and tools,"Knowledge Acquisition, vol. 1, no. 1, pp. 3-37, 1989.
[14] J. M. Bradshaw, "Strategies for selecting and interviewing experts,"Boeing Computer Services Tech. Rep.
[15] J. Breuker and B. Wielinga, "Use of models in the interpretation of verbal data," in A. Kidd,Knowledge Elicitation for Expert Systems: A Practical Handbook. New York: Plenum Press, 1987.
[16] A. M. Burton, N. R. Shadbolt, A. P. Hedgecock, and G. Rugg, "A formal evaluation of knowledge elicitation techniques for expert systems: Domain 1," inProc. 1st European Workshop on Knowledge Acquisition for Knowledge-Based Systems. Reading Univ., Sept. 1987, pp. D3. 1-21.
[17] R. G. Casey and D. R. Feguson "Intelligent forms processing,"IBM Syst. J., vol. 29, no. 3, pp. 435-450, 1990.
[18] G. Ciardiello, M. T. Degrandi, M. P. Poccotelli, G. Scafuro, and M. R. Spada, "An experimental system for office document handling and text recognition," inProc. 9th Int. Conf. Pattern Recognition, 1988, pp. 739-743.
[19] D. A. Cleaves, "Cognitive biases and corrective techniques; proposals for improving elicitation procedures for knowledge-based systems,"Int. J. Man-Machine Studies, vol. 27, pp. 155-166, 1987.
[20] V. Demjanenko, Y. C. Shin, R. Sridhar, P. Palumbo, and S. Srihari, "Real-time connected component analysis for address block location," inProc. 4th USPS Advanced Technology Conf., 1990, pp. 1059-1071.
[21] A. Dengel and G. Barth, "Document description and analysis by cuts," inProc. RIAO. Massachusetts Inst. of Technology, 1988.
[22] A. Dengel and G. Barth, "High level document analysis guided by geometric aspects,"Int. J. Pattern Recogn. and Artificial Intell., vol. 2, no. 4, 641-655, 1988.
[23] A. Dengel, "Document image analysis--expectation-driven text recognition," inProc. Syntactic and Structural Pattern Recogn. (SSPR90). 1990, pp. 78-87.
[24] W. Doster, "Different states of a document's content on its way from the gutenbergian world to the electronic world," inProc. 7th Int. Conf. Pattern Recogn., 1984, pp. 872-874.
[25] A. C. Downton and C. G. Leedham, "Preprocessing and presorting of envelope images for automatic sorting using OCR,"Pattern Recogn., vol. 23, no. 3/4. pp. 347-362, 1990.
[26] D. G. Elliman and I. T. Lancaster, "A review of segmentation and contextual analysis techniques required for automatic text recognition," to be published inPatt. Recogn..
[27] F. Esposito, D. Malerba, G. Semeraro, E. Annese, and G. Scafuro, "An experimental page layout recognition system for office document automatic classification: An integrated approach for inductive generalization," inProc. 10th IEEE Int. Conf. Patt. Recogn.(Atlantic City, NJ), 1990, pp. 557-562.
[28] J. L. Fisher, S. C. Hinds, and D. P. D'Amato. "A rule-based system for document image segmentation," inProc. 10th Int. Conf. Pattern Recogn., 1990, pp. 567-572.
[29] L. A. Fletcher and R. Kasturi, "A robust algorithm for text string separation from mixed text/graphics images,"IEEE Trans. Pattern Anal. Machine Intell., vol. 10, no. 6, pp. 910-918, 1988.
[30] Freiling, J. Alexander, S. Messick, S. Rehfuss, and S. Shulman, "Stating a knowledge engineering project: A step-by-step approach,"AI Mag., vol. 6, no. 3, pp. 150-164, Fall 1985.
[31] H. Fujisawaet al., "Document analysis and decomposition method for multimedia contents retrieval," inProc. 2nd Int. Symp. Interoperable Inform. Syst., 1988, pp. 231-238.
[32] H. Fujisawa and Y. Nakano, "A top-down approach for the analysis of document images," inProc. SSPR90, 1990, pp. 113-122.
[33] B. R. Gaines, "An overview of knowledge acquisition and transfer,"Int. J. Man-Machine Studies, vol. 26, pp. 453-472, 1987.
[34] B. R. Gaines and J. H. Boose, Eds.,Knowledge Acquisition for Knowledge-Based Systems. New York: Academic, 1988.
[35] J. G. Gammack and R. M. Young, "Psychological techniques for eliciting expert knowledge, in R&D in expert system," inProc. 4th Expert System Conf., Warwick, England, 1984. Max Bramer, Ed. Cambridge: Cambrige University Press.
[36] U. Gappa, "Classica: A knowledge acquisition system facilitating the formalization of advanced aspects in heuristic classification," inProc. 2nd European Knowledge Acquisition Workshop (EKAW-88), Bonn, Germany, June 1988, pp. 19, 1-16.
[37] R.C. Gonzalez and P. Wintz,Digital Image Processing, Addison-Wesley, Reading, Mass., 1987.
[38] T. R. Gruber, "Acquiring strategic knowledge from experts,"Int. Man-Machine Studies, vol. 29, pp. 579-597, 1988.
[39] S. Guiasu,Information Theory with Applications. New York: McGraw-Hill, 1977.
[40] J. Higashino, H. Fujisawa, Y. Nakano, and M. Ejiri, "A knowledge-based segmentation method for document understanding," inProc. 8th Int. Conf. Pattern Recogn., 1986, pp. 745-748.
[41] S. C. Hinds, J. L. Fisher, and D. P. D'Amato, "A document skew detection method using run-length encoding and the Hough transform," inProc. 10th Int. Conf. Pattern Recogn., 1990, pp. 464-468.
[42] R. F. Hink and D. L. Woods, "How humans process uncertain knowledge: An introduction for knowledge engineers,"AI Mag., vol. 8, no. 3, pp. 41-53, Fall 1987.
[43] W. Horak, "Office Document Architecture and Office Document Interchange Formats: Current Status of International Standardization,"Computer, Vol. 18, No. 10, Oct. 1985, pp. 50-60.
[44] M. Hose and Y. Hoshino, "Segmentation method of document images by two-dimensional Fourier transformation,"Syst. Comput. in Japan, vol. 16, no. 3, pp. 38-47, 1985.
[45] H. S. Hou,Digital Document Processing. New York: Wiley, 1983.
[46] P. V. C. Hough, "Methods and means for recognizing complex patterns," U. S. Patent 3, 069, 654, 1962.
[47] K. Inagaki, T. Kato, T. Hiroshima, and T. Sakai, "MACSYM: A hierarchical parallel image processing system for event-driven pattern understanding of documents,"Pattern Recogn., vol. 17, no. 1, pp. 85-108, 1984.
[48] ISO 8613: Information Processing-Text and Office Systems-Office Document Architecture (ODA) and Interchange Format, International Organization for Standardization, 1989.
[49] O. Iwaki, H. Kida, and H. Arakawa, "A character/graphic segmentation method using neighbourhood line density,"Trans. Inst. Electron. Commun. Engineers of Japan, Part D, vol. J68D, no. 4, pp. 821-828, 1985.
[50] O. Iwaki, H. Kida, and H. Arakawa, "A segmentation method based on office document hierarchical structure," inProc. IEEE Int. Conf. Syst. Man. Cybernetics, Alexandria, VA, Oct. 1987, pp. 759-763.
[51] C. Jacobson and M. J. Freiling, "ASTEK: A multi-paradigm knowledge acquisition tool for complex structured knowledge,"Int. J. Man-Machine Studies, vol. 29, pp. 311-327, 1988.
[52] V. Jagannathan and A. S. Elmaghraby, "MEDKAT: Multiple expert delphi-based knowledge acquisition tool," inProc. ACM NE Regional Conf., Boston, Oct. 1985, pp. 103-110.
[53] L. Johnson and N. E. Johnson, "Knowledge elicitation involving teach-back interviewing," in A. Kidd,Knowledge Elicitation for Expert Systems: A Practical Handbook. New York: Plenum, 1987.
[54] E. G. Johnston, "Short note: Printed text discrimination,"Comput. Graphics Image Processing, vol. 3, no. 1, pp. 83-89, 1974.
[55] J. Kanai, M. S. Krishnamoorthy, and T. Spencer, "Algorithms for manipulating nested block represented images," inAdvance Printing of Paper Summaries, SPSE's 26th Fall Symp., Arlington, VA, Oct. 1986, pp. 190-193.
[56] H. Kato and S. Inokuchi, "The recognition system for printed piano music using musical knowledge and constraints," inProc. SSPR90. 1990, pp. 231-248.
[57] H. Kida, O. Iwaki, and K. Kawada, "Document recognition system for office automation," inProc. 8th Int. Conf. Pattern Recogn., 1986, pp. 446-448.
[58] A. L. Kidd and M. B. Cooper, "Man-machine interface issues in the construction and use of an expert system."Int. J. Man-Machine Studies. no. 22, pp. 91-102, 1985.
[59] Y. Kodratoff and G. Tecuci, "Learning at different level of knowledge," inProc. 2nd European Knowledge Acquisition Workshop (EKAW-88), Bonn, Germany, June 1988, pp. 3.1-17.
[60] J. Kreich, A. Luhn, G. Maderlechner, "Knowledge based interpretation of scanned business letters," inProc. IAPR Workshop on CV, 1988, pp. 417-420.
[61] K. Kubota, O. Iwaki, and H. Arakawa, "Image segmentation techniques for document processing," inProc. 1983 Int. Conf. Text Processing with a Large Character Set. 1983, pp. 73-78.
[62] K. Kubota, O. Iwaki, and H. Arakawa."Document understanding system," inProc. 7th Int. Conf. Pattern Recogn., 1984, pp. 612-614.
[63] Lenat, D., M. Prakash, and M. Shepherd, "CYC: Using Common Sense Knowledge to Overcome Brittleness and Knowledge Acquisition Bottlenecks,"AI Magazine, No. 6, 1986, pp. 65-85. (CYC)
[64] M. Linster, "Kriton: A knowledge elicitation tool for expert systems," inProc. 2nd European Knowledge Acquisition Workshop (EKAW-88), Bonn, Germany, June 1988, pp. 4.1-9.
[65] H. Makino, "Representation and segmentation of document images," inProc. IEEE Comput. Soc. Conf. Pattern Recogn. and Image Processing. 1983, pp. 291-296.
[66] I. Masuda, N. Hagita, T. Akiyama, T. Takahashi, and S. Naito, "Approach to smart document reader system," inProc. CVPR' 85, 1985, pp. 550-557.
[67] R. S. Michalski, "Theory and methodology of inductive learning," in R. S. Michalski, J. G. Carbonell, and T. M. Mitchell, Eds.,Machine Learning, An Artificial Intelligence Approach. Palo Alto, CA: Tioga, 1983.
[68] K. Morik, "Acquiring domain models."lnt. J. Man-Machine Studies, vol. 26, pp. 93-104, 1987.
[69] G. Nagy, "A preliminary investigation of techniques for the automated reading of unformatted text,"Comm. ACM, vol. 11, no. 7, pp. 480-487, 1968.
[70] G. Nagy and S. Seth, "Hierarchical representation of optically scanned documents," inProc. 7th Int. Conf. Pattern Recogn., 1984, pp. 247-349.
[71] G. Nagy, "Towards a structured-document-image utility," inProc. SSPR90, 1990, pp. 293-309.
[72] G. Nagy, J. Kanai, and M. Krishnamoorthy, "Two complementary techniques for digitized document analysis," inProc. ACM Conf. on Document Processing Systems, 1988, pp. 169-176.
[73] G. Nagy, S. C. Seth, and S. D. Stoddard, "Document analysis with an expert system," in E. S. Gelsema and L. N. Kanal, Eds.,Pattern Recogn. Practice II. New York: Elsevier, 1986, pp. 149-159.
[74] Y. Nakano, H. Fujisawa, O. Kunisaki, K. Okada, and T. Hananoi, "A document understanding system incorporating with character recognition," inProc. 8th Int. Conf. Pattern Recogn., 1986, pp. 801-803.
[75] D. Niyogi and S. N. Srihari, "A rule-based system for document understanding," inProc. AAAI' 86. 1986, pp. 789-793.
[76] E. A. Parrish, Jr, "A foreword to knowledge and data engineering,"IEEE Trans. Knowledge Data Eng., vol. 1, no. 1, pp. 5-7, 1989.
[77] K. Nygaard and O. J. Dahl, "The development of the SIMULA languages," inHistory of Programming Languages, R. L. Wexelblat, Ed. New York: Academic, 1981, pp. 439-480.
[78] T. Pavlidis,Algorithms for Graphics and Image Processing. Rockville, MD: Computer Science Press, 1982.
[79] C. V. Ramamoorthy and B. W. Wah, "Knowledge and data engineering,"IEEE Trans. Knowledge Data Eng., vol. 1, no. 1, pp. 9-15, 1989.
[80] A. Rastogi and S. N. Srihari, "Recognizing textual blocks in document images using the Hough transform," Dept. of Computer Science, State Univ. of New York, Buffalo, Tech, Rep. 86-01, 1986.
[81] R. G. Reynolds, J. I. Maletic, and S. E. Porvin, "PM: A system to support the automatic acquisition of programming knowledge,"IEEE Trans. Knowledge Data Eng., vol. 2. no. 3, pp. 273-282, 1990.
[82] R. A. Rusk and R. M. Krone, "The Crawford slip method (CSM) as a tool for extraction of expert knowledge," inHuman-Computer Interaction,, G. Salvendy, Ed. New York: Elsevier, pp. 279-282.
[83] J. Sandberg, R. Winkels, and J. Breuker, "Knowledge acquisition for intelligent tutoring system," inProc. 2nd European Knowledge Acquisition Workshop (EKAW-88), Bonn, Germany, June 1988, pp. 27.1-12.
[84] W. Scherl, F. Wahl, and H. Fuchsberger, "Automatic separation of text, graphic and picture segments in printed material,"Pattern Recogn. in Practice, 1980, pp. 213-221.
[85] C. E. Shannon, "A mathematical theory of communication,"Bell Syst. Tech. J., vol. 27, pp. 379-423, 1948.
[86] C. E. Shannon, "A mathematical theory of communication,"Bell Syst. Tech. J., vol. 27, pp. 623-656, 1948.
[87] M. L. G. Shaw, "Problems of validation in a knowledge acquisition system using multiple experts," inProc. 2nd European Knowledge Acquisition Workshop (EKAW-88), Bonn, Germany, June 1988, pp. 5.1-15.
[88] Y. Shima, T. Murakami, and M. Koga, "A high speed algorithm for propagation-type labelling based on block sorting of runs in binary images," inProc. 10th Int. Conf. Pattern Recogn., 1990, pp. 655-658.
[89] H. A. Simon, "Whether software engineering needs to be artificially intelligent,"IEEE Trans. Software Eng., vol. SE-12, pp. 726-732, July 1986.
[90] S. Slocombe, K. D. M. Moore, and M. Zelouf, "Engineering expert system applications," inProc. Expert Systems-86, Brighton, 1986.
[91] S. N. Srihari and G. W. Zack, "Document image analysis," inProc. 8th Int. Conf. on Pattern Recogn., 1986, pp. 434-436.
[92] S. N. Srihari, C. H. Wang, P. W. Palumbo, and J. J. Hull, "Recognizing address blocks on mail pieces: specialized tools and problem-solving architecture,"AI Mag., vol. 8, no. 4, pp. 25-40, 1987.
[93] S. N. Srihari and V. Govindaraju, "Analysis of textual images using the Hough transform,"Machine Vision Apppl., vol. 2, pp. 141-153, 1989.
[94] H. Stephanou, "Perspectives on imperfect information processing,"IEEE Trans. Syst. Man Cybern., vol. 17, pp. 780-798, 1987.
[95] C. Y. Suen, Y. Y. Tang, and C. D. Yan, "Document layout and logical model: A general analysis for document processing," Centre for Pattern Recognition and Machine Intelligence (CENPARMI), Concordia Univ., Tech. Rep., 1989.
[96] C. Y. Suen, C. D. Yan, and Y. Y. Tang, "Document analysis and understanding: A method for automated acquisition of data and knowledge," Centre for Pattern Recognition and Machine Intelligence (CENPARMI), Concordia Univ., Tech. Rep., 1990.
[97] Y. Y. Tang, C. D. Yan, and C. Y. Suen, "Form description language and its mapping onto form structure," (CENPARMI), Concordia Univ., Tech. Rep., 1990.
[98] Y. Y. Tang, C. Y. Suen, and C. D. Yan, "Chinese form pre-processing for automatic data entry," inProc. Int. Conf. Computer Processing of Chinese and Oriental Languages. Taipei, Taiwan, Aug. 13-16, 1991, pp. 313-318.
[99] Y. Y. Tang, C. D. Yan, M. Cheriet, and C. Y. Suen, "Financial document analysis and understanding," Centre for Pattern Recognition and Machine Intelligence (CENPARMI), Concordia Univ., Tech. Rep., 1991.
[100] Y. Y. Tang, C. D. Yan, M. Cheriet, and C. Y. Suen, "Document analysis and understanding: A brief survey," inProc. 1st Int. Conf. Document Analysis and Recogn., Saint-Malo, France, Sept. 30-Oct. 2, 1991, pp. 17-31.
[101] J. Toyoda, Y. Noguchi, and Y. Nishimura, "Study of extracting Japanese newspaper article," inProc. 6th Int. Conf. on Pattern Recogn., 1982, pp. 1113-1115.
[102] G. Trimble and C. N. Cooper, "Experience of knowledge acquisition for expert systems in construction," inProc. 1st European Workshop on Knowledge Acquisition for Knowledge-Based Systems, Reading Univ., Sept. 1987, pp. C5.1-14.
[103] Y. Tsuji, "Document image analysis for generating syntactic structure description." inProc. 9th Int. Conf. Pattern Recogn., 1988, pp. 744-747.
[104] S. Tsujimoto and H. Asada, "Understanding multi-articled documents," inProc. 10th Int. Conf. Pattern Recogn., 1990, pp. 551-556.
[105] M. Viswanathan, "Analysis of scanned documents--a syntactic approach," inProc. SSPR90, 1990, pp. 450-459.
[106] F. Wahl, L. Abele, and W. Scheri, "Merkmale fuer die segmentation von dokumenten zur automatischen textverarbeitung," inProc. 4th DAGM Symp., 1981.
[107] C. H. Wang, P. W. Palumbo, and N. Srihari, "Object recognition in visually complex documents: An architecture for locating address blocks on mail pieces," inProc. 9th Int. Conf. Pattern Recogn., 1988, pp. 365-367.
[108] D. Wang and S. N. Srihari, "Classification of newspaper image blocks using texture analysis,"CVGIP, vol. 47, pp. 327-352, 1989.
[109] S. Watanabe,Pattern Recogn.: Human ond Mechanical. New York: Wiley-Interscience, 1985.
[110] D. Garlan, "Views for Tools in Integrated Environments,"Proc. Int'l Workshop Advanced Programming Environments, Springer-Verlag, Berlin, 1986.
[111] M. Welbank, "Knowledge acquisition update,"Insight Study, no. 5, System Designers, 1987.
[112] K. Y. Wong, R. G. Casey, and F. M. Wahl, "Document analysis system,"IBM J. Res. Develop., vol. 26, no. 6, pp. 647-656, 1982.
[113] M. Yamada and K. Hasuike, "Document image processing based on enhanced border following algorithm," inProc. 10th Int. Conf. Pattern Recogn., 1990, pp. 551-556.
[114] C. D. Yan, Y. Y. Tang, and C. Y. Suen, "Form understanding system based on form description language," inProc. 1st Int. Conf. Document Analysis and Recogn., Saint-Malo, France, Sept. 30-Oct. 2, 1991, pp. 283-293.
[115] P. S. Yeh, S. Antoy, A. Litcher, and A. Rosenfeld, "Address location on envelopes,"Pattern Recogn., vol. 20, no. 2, pp. 213-227, 1987.
[116] R. Young and J. Gammack, "Role of psychologic4 techniques and intermediate representation in knowledge elicitation," inProc. 1st European Workshop on Knowledge Acquisition for knowledge-Based Systems, Reading Univ., Sept. 1987, pp. D7.1-5.
[117] T. Y. Young and K. S. Fu,Handbook of Pattern Recogn. and Image Processing. New York: Academic, 1986.

Index Terms:
knowledge acquisition; document handling; visual databases; deductive databases; document processing; automatic knowledge acquisition; knowledge acquisition bottleneck; information systems; geometric structure; logical structure; document analysis; document understanding; entropy analysis; bottom-up approaches; top-down approaches; geometric models; mapping methods; tree transformation; document formatting knowledge; document format description language
Citation:
Y.Y. Tang, C.D. Yan, C.Y. Suen, "Document Processing for Automatic Knowledge Acquisition," IEEE Transactions on Knowledge and Data Engineering, vol. 6, no. 1, pp. 3-21, Feb. 1994, doi:10.1109/69.273022
Usage of this product signifies your acceptance of the Terms of Use.