The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.10 - October (2011 vol.23)
pp: 1469-1482
Kyriakos Mouratidis , Singapore Management University, Singapore
HweeHwa Pang , Singapore Management University, Singapore
ABSTRACT
Consider a text filtering server that monitors a stream of incoming documents for a set of users, who register their interests in the form of continuous text search queries. The task of the server is to constantly maintain for each query a ranked result list, comprising the recent documents (drawn from a sliding window) with the highest similarity to the query. Such a system underlies many text monitoring applications that need to cope with heavy document traffic, such as news and email monitoring. In this paper, we propose the first solution for processing continuous text queries efficiently. Our objective is to support a large number of user queries while sustaining high document arrival rates. Our solution indexes the streamed documents in main memory with a structure based on the principles of the inverted file, and processes document arrival and expiration events with an incremental threshold-based method. We distinguish between two versions of the monitoring algorithm, an eager and a lazy one, which differ in how aggressively they manage the thresholds on the inverted index. Using benchmark queries over a stream of real documents, we experimentally verify the efficiency of our methodology; both its versions are at least an order of magnitude faster than a competitor constructed from existing techniques, with lazy being the best approach overall.
INDEX TERMS
Continuous queries, document streams, text filtering.
CITATION
Kyriakos Mouratidis, HweeHwa Pang, "Efficient Evaluation of Continuous Text Search Queries", IEEE Transactions on Knowledge & Data Engineering, vol.23, no. 10, pp. 1469-1482, October 2011, doi:10.1109/TKDE.2011.125
REFERENCES
[1] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, "Models and Issues in Data Stream Systems," Proc. Twenty-First ACM SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems (PODS '02), pp. 1-16, 2002.
[2] J. Zobel and A. Moffat, "Inverted Files for Text Search Engines," ACM Computing Surveys, vol. 38, no. 2, pp. 1-55, July 2006.
[3] Y. Zhang and J. Callan, "Maximum Likelihood Estimation for Filtering Thresholds," Proc. 24th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR '01), pp. 294-302, 2001.
[4] K. Mouratidis, S. Bakiras, and D. Papadias, "Continuous Monitoring of Top-k Queries over Sliding Windows," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '06), pp. 635-646, 2006.
[5] M. Persin, J. Zobel, and R. Sacks-Davis, "Filtered Document Retrieval with Frequency-Sorted Indexes," J. Am. Soc. for Information Science, vol. 47, no. 10, pp. 749-764, 1996.
[6] V.N. Anh, O. de Kretser, and A. Moffat, "Vector-Space Ranking with Effective Early Termination," Proc. 24th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR '02), pp. 35-42, 2001.
[7] V.N. Anh and A. Moffat, "Impact Transformation: Effective and Efficient Web Retrieval," Proc. 25th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR '02), pp. 3-10, 2002.
[8] H.R. Turtle and J. Flood, "Query Evaluation: Strategies and Optimizations," Information Processing Management, vol. 31, no. 6, pp. 831-850, 1995.
[9] M. Kaszkiel, J. Zobel, and R. Sacks-Davis, "Efficient Passage Ranking for Document Databases," ACM Trans. Information Systems, vol. 17, no. 4, pp. 406-439, 1999.
[10] T. Strohman, H. Turtle, and W.B. Croft, "Optimization Strategies for Complex Queries," Proc. Research and Development in Information Retrieval (SIGIR '05), pp. 219-225, 2005.
[11] S.E. Robertson and D.A. Hull, "The TREC-9 Filtering Track Final Report," Proc. Text REtrieval Conf. (TREC '00), pp. 25-40, 2000.
[12] Y. Zhang and J. Callan, "YFilter at TREC9," Proc. Text REtrieval Conf. (TREC '00), pp. 135-140, 2000.
[13] Y.-C. Chang, L.D. Bergman, V. Castelli, C.-S. Li, M.-L. Lo, and J.R. Smith, "The Onion Technique: Indexing for Linear Optimization Queries," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '00), pp. 391-402, 2000.
[14] V. Hristidis and Y. Papakonstantinou, "Algorithms and Applications for Answering Ranked Queries Using Ranked Views," Int'l J. Very Large Data Bases, vol. 13, no. 1, pp. 49-70, 2004.
[15] N. Bruno, S. Chaudhuri, and L. Gravano, "Top-k Selection Queries over Relational Databases: Mapping Strategies and Performance Evaluation," ACM Trans. Database Systems, vol. 27, no. 2, pp. 153-187, 2002.
[16] C.-M. Chen and Y. Ling, "A Sampling-Based Estimator for Top-k Query," Proc. 18th Int'l Conf. Data Eng. (ICDE '02), pp. 617-627, 2002.
[17] D. Donjerkovic and R. Ramakrishnan, "Probabilistic Optimization of Top N Queries," Proc. 25th Int'l Conf. Very Large Data Bases (VLDB '99), pp. 411-422, 1999.
[18] I.F. Ilyas, W.G. Aref, and A.K. Elmagarmid, "Joining Ranked Inputs in Practice," Proc. 28th Int'l Conf. Very Large Data Bases (VLDB '02), pp. 950-961, 2002.
[19] I.F. Ilyas, R. Shah, W.G. Aref, J.S. Vitter, and A.K. Elmagarmid, "Rank-Aware Query Optimization," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '04), pp. 203-214, 2004.
[20] P. Tsaparas, T. Palpanas, Y. Kotidis, N. Koudas, and D. Srivastava, "Ranked Join Indices," Proc. 19th Int'l Conf. Data Eng. (ICDE '03), pp. 277-288, 2003.
[21] R. Fagin, A. Lotem, and M. Naor, "Optimal Aggregation Algorithms for Middleware," J. Computer and Systems Sciences, vol. 66, no. 4, pp. 614-656, 2003.
[22] S. Chaudhuri, L. Gravano, and A. Marian, "Optimizing Top-k Selection Queries over Multimedia Repositories," IEEE Trans. Knowledge and Data Eng., vol. 16, no. 8, pp. 992-1009, Aug. 2004.
[23] M. Theobald, G. Weikum, and R. Schenkel, "Top-k Query Evaluation with Probabilistic Guarantees," Proc. Very Large Data Bases (VLDB '04), pp. 648-659, 2004.
[24] A. Marian, N. Bruno, and L. Gravano, "Evaluating Top-k Queries over Web-Accessible Databases," ACM Trans. Database Systems, vol. 29, no. 2, pp. 319-362, 2004.
[25] K.C.-C. Chang and S. won Hwang, "Minimal Probing: Supporting Expensive Predicates for Top-k Queries," Proc. ACM SIGMOD Int'l Conf. Management of Data Conf. (SIGMOD '02), pp. 346-357, 2002.
[26] K. Yi, H. Yu, J. Yang, G. Xia, and Y. Chen, "Efficient Maintenance of Materialized Top-$k$ Views," Proc. 19th Int'l Conf. Data Eng. (ICDE '03), pp. 189-200, 2003.
[27] F. Korn, B.-U. Pagel, and C. Faloutsos, "On the 'Dimensionality Curse' and the 'Self-Similarity Blessing'," IEEE Trans. Knowledge and Data Eng., vol. 13, no. 1, pp. 96-111, Jan. 2001.
[28] B. Babcock and C. Olston, "Distributed Top-k Monitoring," Proc. ACM SIGMOD Int'l Conf. Management of Data Conf. (SIGMOD '03), pp. 28-39, 2003.
[29] C.L. Clarke, G.V. Cormack, and F.J. Burkowski, "Fast Inverted Indexes with On-Line Update," Technical Report CS-94-40, Dept. of Computer Science, Univ. of Waterloo, 1994.
[30] N. Lester, J. Zobel, and H. Williams, "Efficient Online Index Maintenance for Contiguous Inverted Lists," Information Processing and Management, vol. 42, no. 4, pp. 916-933, 2006.
[31] N. Lester, A. Moffat, and J. Zobel, "Fast On-Line Index Construction by Geometric Partitioning," Proc. 14th ACM Int'l Conf. Information and Knowledge Management (CIKM '05), pp. 776-783, 2005.
[32] S. Büttcher and C.L.A. Clarke, "Indexing Time versus Query Time: Trade-Offs in Dynamic Information Retrieval Systems," Proc. 14th ACM Int'l Conf. Information and Knowledge Management (CIKM '05), pp. 317-318, 2005.
[33] T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein, Introduction to Algorithms, second ed. MIT Press, 2001.
[34] M. Persin, "Efficient Implementation of Text Retrieval Techniques," technical report, Royal Melbourne Inst. of Tech nology, 1996.
[35] R. Baeza-Yates and B.R. Neto, Modern Information Retrieval. Addison Wesley, 1999.
[36] TREC, "Text REtrieval Conference," http:/trec.nist.gov/, 2011.
[37] R.K. Abbott and H. Garcia-Molina, "Scheduling Real-Time Transactions: A Performance Evaluation," Proc. Very Large Data Bases (VLDB '88), pp. 1-12, 1988.
[38] S. Brin and L. Page, "The Anatomy of a Large-Scale Hypertextual Web Search Engine," Computer Networks, vol. 30, nos. 1-7, pp. 107-117, 1998.
[39] J.M. Kleinberg, "Authoritative Sources in a Hyperlinked Environment," J. ACM, vol. 46, no. 5, pp. 604-632, 1999.
14 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool