Issue No.12 - Dec. (2012 vol.24)
pp: 2203-2217
Yongwook Shin , Seoul National University, Seoul
Junseok Lim , Seoul National University, Seoul
Jonghun Park , Seoul National University, Seoul
Real-time search engines are increasingly indexing web content using data streams, since a number of web sources including news and social media sites are now delivering up-to-date information via streams. Accordingly, it is a crucial challenge for a real-time search engine using data streams to improve index freshness that primarily depends on the latencies involved during fetching and indexing processes. Retrieval latency is a time lag between document publication and fetching while indexing latency is a delay required for a fetched document to be indexed, which is caused by finiteness of indexing capacity. The problem of retrieval latency can be satisfactorily addressed by use of appropriate fetching scheduling or recent real-time content notification protocols. However, as the entire volume of real-time content rapidly grows, the indexing latency becomes a challenging problem. Furthermore, the need for maximizing index coverage makes it more difficult to reduce the indexing latency under the limited indexing capacity. We consider a problem of jointly optimizing the indexing latency as well as index coverage, in which their relative importance can be adjusted, and propose an optimization model based on inventory control theory. Extensive experiments have been conducted to validate the proposed model, and suggest that the proposed approach outperforms the other alternatives.
Indexing, Erbium, Real time systems, Search engines, Delay, Inventory control, information retrieval, Feed, index freshness, index coverage, real-time search, search engine
Yongwook Shin, Junseok Lim, Jonghun Park, "Joint Optimization of Index Freshness and Coverage in Real-Time Search Engines", IEEE Transactions on Knowledge & Data Engineering, vol.24, no. 12, pp. 2203-2217, Dec. 2012, doi:10.1109/TKDE.2011.144
[1] Apache Lucene, http:/, 2012.
[2] A. Arasu et al., "Searching the Web," ACM Trans. Internet Technology, vol. 1, no. 1, pp. 2-43, 2001.
[3] S. Axsäter, Inventory Control. Springer, 2006.
[4] M.S. Bazaraa, H.D. Sherali, and C.M. Shetty, Nonlinear Programming: Theory and Algorithms. Wiley, 1993.
[5] C. Castillo, A. Nelli, and A. Panconesi, "Crawling the Web With Limited Memory," Proc. Web Intelligence Conf., 2006.
[6] J. Cho and H. Garcia-Molina, "Synchronizing a Database to Improve Freshness," Proc. ACM SIGMOD Int'l Conf. Management of Data, 2000.
[7] J. Cho and H. Garcia-Molina, "Effective Page Refresh Policies for Web Crawlers," ACM Trans. Database Systems, vol. 28, no 4, pp. 390-426, 2003.
[8] D. Chmielewski and G. Hu, "A Distributed Platform for Archiving and Retrieving RSS Feeds," Proc. Fourth ACIS Int'l Conf. Computer and Information Science, pp. 215-220, 2005.
[9] E.G. Coffman, Jr., Z. Liu, and R.R. Webber, "Optimal Robot Scheduling for Web Search Engines," J. Scheduling, vol. 1, no. 1, pp. 15-29, 1998.
[10] W.B. Croft, D. Metzler, and T. Stronhman, Search Engines: Information Retrieval in Practice. Addison Wesley, 2010.
[11] J. Edwards, K. McCurley, and J. Tomlin, "An Adaptive Model of Optimizing Performance of an Incremental Web Crawler," Proc. Ninth Int'l World Wide Web Conf. (WWW), 2000.
[12] B. Fitzpatrick et al., "PubSubHubbub Core 0.3," pubsubhubbub-core-0.3.html, 2012.
[13] D. Geer, "Is It Really Time for Real-Time Search?" Computer, vol. 43, no. 3, pp. 16-19, Mar. 2010.
[14] Google Real-Time Search,, 2012.
[15] S.K. Goyal and B.C. Giri, "Recent Trends in Modeling of Deteriorating Inventory," European J. Operational Research, vol. 134, pp. 1-16, 2001.
[16] Ü. Gürler and B.Y. Özkaya, "Analysis of the (s, S) Policy for Perishables with a Random Shelf Life," IIE Trans., vol. 40, pp. 759-781, 2008.
[17] S. Gurumurthy et al., "Improving Web Search Relevance and Freshness with Content Previews," Proc. 19th ACM Int'l Conf. Information and Knowledge Management (CIKM), 2010.
[18] A. Heydon and M. Najork, "Mercator: A Scalable, Extensible Web Crawler," World Wide Web, vol. 2, pp. 219-229, 1999.
[19] B.J. Jansen, G. Campbell, and M. Gregg, "Real Time Search User Behavior," Proc. 28th ACM Conf. Human Factors in Computing Systems (CHI), 2010.
[20] B.S. Maddah, M.Y. Jaber, and N.E. Abboud, "Periodic Review (s, S) Inventory Model with Permissible Delay in Payments," J. Operational Research Soc., vol. 55, pp. 147-159, 2004.
[21] K.L. Mak, "A Production Lot Size Inventory Model for Deteriorating Items," Computers and Industrial Eng., vol. 6, pp. 309-317, 1982.
[22] B. Niu and J. Xie, "A Note on Two-Warehouse Inventory Model with Deterioration under FIFO Dispatch Policy," European J. Operational Research, vol. 190, pp. 571-577, 2008.
[23] C. Olston and M. Najork, "Web Crawlings," Foundations and Trends in Information Retrieval, vol. 4, pp. 175-246, 2010.
[24] S. Pandey and C. Olston, "User-Centric Web Crawling," Proc. 14th Int'l World Wide Web Conf. (WWW), 2005.
[25] G. Pant, P. Srinivasan, and F. Menczer, "Crawling the Web," Web Dynamics: Adapting to Change in Content, Size, Topology and Use, Springer-Verlag, 2004.
[26] J. Park et al., "Searching Social Media Streams on the Web," IEEE Intelligent Systems, vol. 25, no. 6, pp. 24-31, 2010.
[27] F. Raafat, P.M. Wolfe, and H.K. Eldin, "An Inventory Model for Deteriorating Items," Computers and Industrial Eng., vol. 20, pp. 89-94, 1991.
[28] F. Raafat, "Survey of Literature on Continuously Deteriorating Inventory Models," J. Operational Research Soc., vol. 42, no. 1, pp. 27-37, 1991.
[29] ROME Feed Fetcher,, 2012.
[30] S.M. Ross, Stochastic Process. Wiley, 1996.
[31] P. Saint-Andre, "XMPP: Lessons Learned from Ten Years of XML Messaging," IEEE Comm. Magazine, vol. 47, no. 4, pp. 92-96, Apr. 2009.
[32] P. Saint-Andre, "Extensible Messaging and Presence Protocol (XMPP): Core," , 2012.
[33] K.C. Sia, J. Cho, and H. Cho, "Efficient Monitoring Algorithm for Fast News Alerts," IEEE Trans. Knowledge and Data Eng., vol. 19, no. 7, pp. 950-961, July 2007.
[34] Twitter Search, http:/, 2012.
[35] J.L. Wolf et al., "Optimal Crawling Strategies for Web Search Engines," Proc. 11th Int'l World Wide Web Conf. (WWW), 2002.