This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
A Methodology to Retrieve Text Documents from Multiple Databases
November/December 2002 (vol. 14 no. 6)
pp. 1347-1361

Abstract—This paper presents a methodology for finding the n most similar documents across multiple text databases for any given query and for any positive integer n. This methodology consists of two steps. First, the contents of databases are indicated approximately by database representatives. Databases are ranked using their representatives with respect to the given query. We provide a necessary and sufficient condition to rank the databases optimally. In order to satisfy this condition, we provide three estimation methods. One estimation method is intended for short queries; the other two are for all queries. Second, we provide an algorithm, OptDocRetrv, to retrieve documents from the databases according to their rank and in a particular way. We show that if the databases containing the n most similar documents for a given query are ranked ahead of other databases, our methodology will guarantee the retrieval of the n most similar documents for the query. When the number of databases is large, we propose to organize database representatives into a hierarchy and employ a best-search algorithm to search the hierarchy. It is shown that the effectiveness of the best-search algorithm is the same as that of evaluating the user query against all database representatives.

[1] C. Baumgarten, “A Probabilistic Model for Distributed Information Retrieval,” Proc. ACM SIGIR Conf., 1997.
[2] C. Baumgarten, “A Probabilistic Solution to the Selection and Fusion Problem in Distributed Information Retrieval,” Proc. ACM Special Interest Group on Information Retrieval Conf., pp. 246-253, Aug. 1999.
[3] N.J. Belkin, P. Kantor, E.A. Fox, and J.A. Shaw, “Combining the Evidence of Multiple Query Representations for Information Retrieval,” Information Processing&Management, vol. 31, no. 3, pp. 431-448, May-June, 1995.
[4] S. Brin and L. Page, "The Anatomy of a Large-Scale Hypertextual Web Search Engine," Proc. 7th WWW Conf., 1998; .
[5] J. Callan, Z. Lu, and W. Croft, “Searching Distributed Collections with Inference Networks,” Proc. 18th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 21-28, 1995.
[6] D. Dreilinger and A. Howe, "Experiences with Selecting Search Engines Using Metasearch," ACM Trans. on Information Systems, Vol. 15, No. 3, July 1997, pp. 195-222.
[7] Y. Fan and S. Gauch, “Adaptive Agents for Information Gathering from Multiple, Distributed Information Sources,” Proc. AAAI Symp. Intelligent Agents in Cyberspace, Mar. 1999.
[8] J. French, “Evaluating Database Selection Techniques: A Testbed and Experiment,” Proc. ACM Special Interest Group on Information Retrieval Conf., pp. 121-129, Aug. 1998.
[9] J. French, “Comparing the Performance of Database Selection Algorithms,” Proc. ACM Special Interest Group on Information Retrieval Conf., pp. 238-245, Aug. 1999.
[10] N. Fuhr, “A Decision-Theoretic Approach to Database Selection in Networked IR,” ACM Trans. Information Systems, vol. 17, no. 3, pp. 229-249, July 1999.
[11] G. Furnas, “Information Retrieval Using a Singular Value Decomposition Model of Latent Semantic Structure,” Proc. ACM Special Interest Group on Information Retrieval Conf., pp. 465-480, June 1988.
[12] S. Gauch, G. Wang, and M. Gomez, “ProFusion: Intelligent Fusion from Multiple, Distributed Search Engines,” J. Universal Computer Science, vol. 2, no. 9, pp. 637-649, 1996.
[13] S. Gauch, J. Wang, and S. Rachakonda, “A Corpus Analysis Approach for Automatic Query Expansion and Its Extension to Multiple Databases,” Proc. ACM Trans. Information Systems, vol. 17, no. 3, pp. 250-269, July 1999.
[14] L. Gravano and H. Garcia-Molina, “Generalizing GLOSS to Vector-Space Databases and Broker Hierarchies,” Proc. 21st Int'l Conf. Very Large Databases (VLDB), pp. 78-89, 1995.
[15] L. Gravano and H. Garcia-Molina, “Generalizing GlOSS to Vector-Space databases and Broker Hierarchies,” technical report, Computer Science Dept., Stanford Univ., 1995.
[16] L. Gravano and H. Garcia-Molina, “Merging Ranks from Heterogeneous Internet Sources,” Int'l Conf. Very Large Data Bases, pp. 196-205, Aug. 1997.
[17] A. Howe and D. Dreilinger, “SavvySearch: A Meta-Search Engine that Learns Which Search Engines to Query,” AI Magazine, vol. 18, no. 2, 1997.
[18] B. Jansen, A. Spink, J. Bateman, and T. Saracevic, “Real Life Information Retrieval: A Study of User Queries on the Web,” ACM SIGIR Forum, vol. 32, no. 1, 1998.
[19] B. Kahle and A. Medlar, “An Information System for Corporate Users: Wide Area information Servers,” Technical Report TMC199, Thinking Machine Corporation, Apr. 1991.
[20] S. Kirsch, “The Future of Internet Search: Infoseek's Experiences Searching the Internet” Proc. ACM Special Interest Group on Information Retrieval Forum, vol. 32, no. 2, pp. 3-7, 1998.
[21] J. Kleinberg, "Authoritative Sources in a Hyperlinked Environment," Proc. 9th ACM-SIAM Symp. Discrete Algorithms, ACM Press, 1998, pp. 668-677.
[22] G. Kowalski, Information Retrieval Systems: Theory and Implementation, Kluwer Academic Publishers, Boston, 1997.
[23] K. Kwok, M. Chan, “Improving Two-Stage Ad-Hoc Retrieval for Short Queries,” Proc. ACM Special Interest Group on Information Retrieval Conf., pp. 250-256, Aug. 1998.
[24] S. Lawrence and C.L. Giles, “Searching the World Wide Web,” Science, vol. 280, pp. 98-100, Apr. 1998.
[25] S. Lawrence and C.L. Giles, “Accessibility of Information on the Web,” Nature, vol. 400, pp. 107-109, July 1999.
[26] K. Liu, C. Yu, W. Meng, W. Wu, and N. Rishe, “A Statistical Method for Estimating the Usefulness of Text Databases,” IEEE Trans. Knowledge and Data Eng., vol. 14, no. 6, Nov./Dec. 2002.
[27] W. Meng, K. Liu, C. Yu, X. Wang, Y. Chang, and N. Rishe, “Determining Text Databases to Search on the Internet,” Proc. Int'l Conf. Very Large Data Bases, 1998.
[28] W. Meng, K. Liu, C. Yu, W. Wu, and N. Rishe, “Estimating the Usefulness of Search Engines,” Proc. IEEE Int'l Conf. Data Eng., pp. 146-153, 1999.
[29] G. Salton and M. McGill, Introduction to Modern Information Retrieval, McGraw Hill, New York, 1983.
[30] G. Salton, Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer, Addison Wesley, New York, 1989.
[31] E. Selberg and O. Etzioni, “Multiservice Search and Comparison Using the MetaCrawler,” Proc. Fourth Int'l World Wide Web Conf., Dec. 1995.
[32] E. Selberg and O. Etzioni, "The MetaCrawler Architecture for Resource Aggregation on the Web," IEEE Expert, Jan.-Feb. 1997, pp. 11-14; also available at.
[33] A. Singhal, C. Buckley, and M. Mitra, “Pivoted Document Length Normalization,” Proc. ACM SIGIR Conf., 1996.
[34] E. Voorhees, N. Gupta, and B. Johnson-Laird, “Learning Collection Fusion Strategies,” Proc. ACM Special Interest Group on Information Retrieval Conf., pp. 172-179, July 1995.
[35] J. Xu and J. Callan, “Effective Retrieval with Distributed Collections,” Proc. ACM Special Interest Group on Information Retrieval Conf., pp. 112-120, Aug. 1998.
[36] C. Yu, K. Liu, W. Wu, W. Meng, and N. Rishe, “Finding the Most Similar Documents across Multiple Text Databases,” Proc. IEEE Conf. Advances in Digital Libraries, pp. 150-162, 1999.
[37] C. Yu, W. Meng, K. Liu, W. Wu, and N. Rishe, “Efficient and Effective Metasearch for a Large Number of Text Databases,” Proc. Eighth ACM Int'l Conf. Information and Knowledge Management, pp. 217-224, Nov. 1999.
[38] C. Yu and W. Meng, Principles of Database Query Processing for Advanced Applications. Morgan Kaufmann, 1998.
[39] B. Yuwono and D. Lee, “Server Ranking for Distributed Text Resource Systems on the Internet,” Proc. Fifth Int'l Conf. Database Systems for Advanced Applications (DASFAA '97), pp. 391-400, Apr. 1997.

Index Terms:
Distributed information retrieval, resource discovery, database selection, metasearch.
Citation:
Clement Yu, King-Lup Liu, Weiyi Meng, Zonghuan Wu, Naphtali Rishe, "A Methodology to Retrieve Text Documents from Multiple Databases," IEEE Transactions on Knowledge and Data Engineering, vol. 14, no. 6, pp. 1347-1361, Nov.-Dec. 2002, doi:10.1109/TKDE.2002.1047772
Usage of this product signifies your acceptance of the Terms of Use.