2013 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (2011)
Beijing, China
Oct. 10, 2011 to Oct. 12, 2011
ISBN: 978-0-7695-4557-8
pp: 32-39
Mining of repeated patterns from HTML documents is the key step towards Web-based data mining and knowledge extraction. Many web crawling applications need efficient repeated patterns mining techniques to generate their wrapper automatically. Existing approaches such as tree matching and string matching can detect repeated patterns with high precision, but their performance is still a challenge for practical web crawling applications. In this paper, we propose an efficient approach for mining repeated patterns based on indent shape of HTML document. Indent shape is a novel and simple model of HTML document, in which tandem repeated waves have strong association with the repeated patterns to be detected. By scanning an indent shape with a horizontal indent-line from bottom to top, the tandem repeated waves are identified by filtering the wave segments with low self-similarities. After that the boundary of HTML code corresponding to repeated patterns can be identified, which could be transformed to regular expressions formal-defined easily. Extensive experiments on two practical data sets retrieved from Internet show that our approach achieves high efficiency significantly, and its precision performance is also generally better than the existing approaches.
Repeated Patterns Mining, Web Crawling, Indent shape, Tandem Repeated Waves
