This Article 
 Bibliographic References 
 Add to: 
A Statistical Method for Estimating the Usefulness of Text Databases
November/December 2002 (vol. 14 no. 6)
pp. 1422-1437

Abstract—Searching desired data on the Internet is one of the most common ways the Internet is used. No single search engine is capable of searching all data on the Internet. The approach that provides an interface for invoking multiple search engines for each user query has the potential to satisfy more users. When the number of search engines under the interface is large, invoking all search engines for each query is often not cost effective because it creates unnecessary network traffic by sending the query to a large number of useless search engines and searching these useless search engines wastes local resources. The problem can be overcome if the usefulness of every search engine with respect to each query can be predicted. In this paper, we present a statistical method to estimate the usefulness of a search engine for any given query. For a given query, the usefulness of a search engine in this paper is defined to be a combination of the number of documents in the search engine that are sufficiently similar to the query and the average similarity of these documents. Experimental results indicate that our estimation method is much more accurate than existing methods.

[1] G. Abdulla, B. Liu, R. Saad, and E. Fox, “Characterizing World Wide Web Queries,” Technical Report, TR-97-04, Virginia Polytechnic Inst. and State Univ., 1997.
[2] C. Baumgarten, “A Probabilistic Model for Distributed Information Retrieval,” Proc. ACM SIGIR Conf., 1997.
[3] J. Callan, Z. Lu, and W. Croft, “Searching Distributed Collections with Inference Networks,” Proc. 18th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 21-28, 1995.
[4] J. Callan, M. Connell, and A. Du, “Automatic Discovery of Language Models for Text Databases,” Proc. ACM-SIGMOD Int'l Conf. Management of Data, pp. 479-490, 1999.
[5] L. Gravano, C. Chang, H. Garcia-Molina, and A. Paepcke, “STARTS: Stanford Proposal for Internet Meta-Searching,” Proc. ACM SIGMOD Conf., pp. 207-218, 1997.
[6] L. Gravano and H. Garcia-Molina, “Generalizing GLOSS to Vector-Space Databases and Broker Hierarchies,” Proc. 21st Int'l Conf. Very Large Databases (VLDB), pp. 78-89, 1995.
[7] L. Gravano and H. Garcia-Molina, “Generalizing GlOSS to Vector-Space Databases and Broker Hierarchies,” technical report, Computer Science Dept., Stanford Univ., 1995. (This report discussed how to estimate the database usefulness used defined in this paper for the high-correlation and disjoint scenarios. Such discussion did not appear in [6].)
[8] D.K. Harman, "Overview of the Third Text Retrieval Conference," Proc. Third Text Retrieval Conference, TREC-3, Int'l Inst. of Standards and Tech nology, Gaithersburg, Md., 1995, pp. 1-19.
[9] A. Howe and D. Dreilinger, “SavvySearch: A Meta-Search Engine that Learns Which Search Engines to Query,” AI Magazine, vol. 18, no. 2, 1997.
[10] Information and Data Management: Research Agenda for the 21st Century, Information and Data Management Program, Nat'l Science Foundation, Mar. 1998.
[11] B. Jansen, A. Spink, J. Bateman, and T. Saracevic, “Real Life Information Retrieval: A Study of User Queries on the Web,” ACM SIGIR Forum, vol. 32, no. 1, 1998.
[12] B. Kahle and A. Medlar, “An Information System for Corporate Users: Wide Area information Servers,” Technical Report TMC199, Thinking Machine Corp., Apr. 1991.
[13] M. Koster, “ALIWEB: Archie-Like Indexing in the Web,” Computer Networks and ISDN Systems, vol. 27, no. 2, pp. 175-182, 1994.
[14] G. Kowalski, Information Retrieval Systems: Theory and Implementation, Kluwer Academic Publishers, Boston, 1997.
[15] S. Lawrence and C.L. Giles, “Searching the World Wide Web,” Science, vol. 280, pp. 98-100, Apr. 1998.
[16] K. Lam and C. Yu, “A Clustered Search Algorithm Incorporating Arbitrary Term Dependencies,” ACM Trans. Database Systems, Sept. 1982.
[17] U. Manber and P. Bigot, “The Search Broker,” Proc. USENIX Symp. Internet Technologies and Systems (NSITS '97), pp. 231-239, 1997.
[18] W. Meng, K. Liu, C. Yu, X. Wang, Y. Chang, and N. Rishe, “Determining Text Databases to Search on the Internet,” Proc. Int'l Conf. Very Large Data Bases, 1998.
[19] W. Meng, K. Liu, C. Yu, W. Wu, and N. Rishe, “Estimating the Usefulness of Search Engines,” Proc. IEEE Int'l Conf. Data Eng., pp. 146-153, 1999.
[20] C.J. Van Rijsgergen, Information Retrieval. Hyper-text book ( ).
[21] A. Singhal, C. Buckley, and M. Mitra, “Pivoted Document Length Normalization,” Proc. ACM SIGIR Conf., 1996.
[22] G. Salton and M. McGill, Introduction to Modern Information Retrieval, McGraw Hill, New York, 1983.
[23] T.W. Yan and H. Garcia-Molina, “SIFT—A Tool for Wide-Area Information Dissemination,” Proc. USENIX 1995 Technical Conf., 1995.
[24] C. Yu, K. Liu, W. Wu, W. Meng, and N. Rishe, “Finding the Most Similar Documents across Multiple Text Databases,” Proc. IEEE Conf. Advances in Digital Libraries, pp. 150-162, 1999.
[25] C. Yu, W. Luk, and M. Siu, “On the Estimation of the Number of Desired Records with Respect to a Given Query,” Proc. ACM Trans. Database Systems, Mar. 1978.
[26] C.T. Yu and W. Meng, Principle of Database Query Processing for Advanced Applications. San Francisco: Morgan Kaufmann, 1997.
[27] B. Yuwono and D. Lee, “Server Ranking for Distributed Text Resource Systems on the Internet,” Proc. Fifth Int'l Conf. Database Systems for Advanced Applications (DASFAA '97), pp. 391-400, Apr. 1997.

Index Terms:
Metasearch, information resource discovery, information retrieval.
King-Lup Liu, Clement Yu, Weiyi Meng, Wensheng Wu, Naphtali Rishe, "A Statistical Method for Estimating the Usefulness of Text Databases," IEEE Transactions on Knowledge and Data Engineering, vol. 14, no. 6, pp. 1422-1437, Nov.-Dec. 2002, doi:10.1109/TKDE.2002.1047777
Usage of this product signifies your acceptance of the Terms of Use.