Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) Vol 1
Learning of Pattern-Based Rules for Document Classification
Curitiba, Parana, Brazil
September 23-September 26
ISBN: 0-7695-2822-8
Automatic processing of office documents, such as orders, invoices, or offers entails a significant poten- tial for saving costs. Because such domains have a high percentage of special vocabulary, purely statisti- cal approaches fail in automatic classification. The inherent structure and short text messages require spe- cific approaches. We propose a rule-based method to classify mixed stacks of documents into a set of hierar- chically organized classes. Rules are learned by ex- tracting patterns of different types from a document sample. The paper focuses on the architecture and on the learning process, presents comparing results to other techniques, and gives an outlook on how to fur- ther improve the system.