2013 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (2012)
Sanya, China China
Oct. 10, 2012 to Oct. 12, 2012
Open Source Forge (OSF) websites provide information on massive open source software projects, extracting these web data is important for open source research. Traditional extraction methods use string matching among pages to detect page template, which is time-consuming. A recent work published in VLDB exploits redundant entities among websites to detect web page coordinates of these entities. The experiment gives good results when these coordinates are used for extracting other entities of the target site. However, OSF websites have few redundant project entities. This paper proposes a modified version of that redundancy-based method tailored for OSF websites, which relies on a similar yet weaker presumption that entity attributes are redundant rather than whole entities. Like the previous work, we also construct a seed database to detect web page coordinates of the redundancies, but all at the attribute-level. In addition, we apply attribute name verification to reduce false positives during extraction. The experiment result indicates that our approach is competent in extracting OSF websites, in which scenario the previous method can not be applied.
Redundancy, HTML, Databases, Licenses, Data mining, Measurement, open source software, web extraction, redundancy, attribute, verification
Xiang Li, Yanxu Zhu, Gang Yin, Tao Wang, Huaimin Wang, "Exploiting Attribute Redundancy in Extracting Open Source Forge Websites", 2013 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery, vol. 00, no. , pp. 13-20, 2012, doi:10.1109/CyberC.2012.12