Proceedings of the 41st Annual Hawaii International Conference on System Sciences (HICSS 2008) (2008)
Waikoloa, Big Island, Hawaii
Jan. 7, 2008 to Jan. 10, 2008
The field of automatic genre classification has primarily focused on extracting textual features from documents. The goal of this research is to investigate whether visual features of HTML web pages can improve the classification of fine-grained genres. Intuitively it seems that this should be helpful and the challenge is to extract those visual features that capture the layout characteristics of the genres. A corpus of Web pages from different e-commerce sites was generated and manually classified into several genres. Three different sets of features were compared - one with just textual features, one with HTML level features added, and a third with visual features added. Our experiments confirm that using HTML features and particularly URL address features can improve classification beyond using textual features alone. We also show that adding visual features can be useful for further improving fine-grained genre classification.
R. Levering, M. Cutler and L. Yu, "Using Visual Features for Fine-Grained Genre Classification of Web Pages," Proceedings of the 41st Annual Hawaii International Conference on System Sciences (HICSS 2008)(HICSS), Waikoloa, Big Island, Hawaii, 2008, pp. 131.