Issue No. 12 - Dec. (2012 vol. 24)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.144
Yongwook Shin , Seoul National University, Seoul
Junseok Lim , Seoul National University, Seoul
Jonghun Park , Seoul National University, Seoul
Real-time search engines are increasingly indexing web content using data streams, since a number of web sources including news and social media sites are now delivering up-to-date information via streams. Accordingly, it is a crucial challenge for a real-time search engine using data streams to improve index freshness that primarily depends on the latencies involved during fetching and indexing processes. Retrieval latency is a time lag between document publication and fetching while indexing latency is a delay required for a fetched document to be indexed, which is caused by finiteness of indexing capacity. The problem of retrieval latency can be satisfactorily addressed by use of appropriate fetching scheduling or recent real-time content notification protocols. However, as the entire volume of real-time content rapidly grows, the indexing latency becomes a challenging problem. Furthermore, the need for maximizing index coverage makes it more difficult to reduce the indexing latency under the limited indexing capacity. We consider a problem of jointly optimizing the indexing latency as well as index coverage, in which their relative importance can be adjusted, and propose an optimization model based on inventory control theory. Extensive experiments have been conducted to validate the proposed model, and suggest that the proposed approach outperforms the other alternatives.
Indexing, Erbium, Real time systems, Search engines, Delay, Inventory control, information retrieval, Feed, index freshness, index coverage, real-time search, search engine
J. Park, Y. Shin and J. Lim, "Joint Optimization of Index Freshness and Coverage in Real-Time Search Engines," in IEEE Transactions on Knowledge & Data Engineering, vol. 24, no. , pp. 2203-2217, 2012.