Proceedings. 20th International Conference on Data Engineering (2004)
Mar. 30, 2004 to Apr. 2, 2004
Neeraj Agrawal , IBM India Research Lab
Rema Ananthanarayanan , IBM India Research Lab
Rahul Gupta , IBM India Research Lab
Sachindra Joshi , IBM India Research Lab
Raghu Krishnapuram , IBM India Research Lab
Sumit Negi , IBM India Research Lab
Data presented on commerce sites runs into thousands of pages, and is typically delivered from multiple back-end sources. This makes it difficult to identify incorrect, anomalous, or interesting data such as $9.99 air fares, missing links, drastic changes in prices and addition of new products or promotions. In this paper, we describe a system that monitors Websites automatically and generates various types of reports so that the content of the site can be monitored and the quality maintained. The solution designed and implemented by us consists of a site crawler that crawls dynamic pages, an information miner that learns to extract useful information from the pages based on examples provided by the user, and a reporter that can be configured by the user to answer specific queries. The tool can also be used for identifying price trends and new products or promotions at competitor sites. A pilot run of the tool has been successfully completed at the ibm.com site.
N. Agrawal, S. Joshi, R. Ananthanarayanan, S. Negi, R. Krishnapuram and R. Gupta, "EShopMonitor: A Web Content Monitoring Tool," Proceedings. 20th International Conference on Data Engineering(ICDE), Boston, Massachusetts, 2004, pp. 817.