This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Information Retrieval with Distributed Databases: Analytic Models of Performance
January 2004 (vol. 15 no. 1)
pp. 18-27

Abstract—The major emphasis of this paper is on analytical techniques for predicting the performance of various collection fusion scenarios. Knowledge of analytical models of information retrieval system performance, both with single processors and with multiple processors, increases our understanding of the parameters (e.g., number of documents, ranking algorithms, stemming algorithms, stop word lists, etc.) affecting system behavior. While there is a growing literature on the implementation of distributed information retrieval systems and digital libraries, little research has focused on analytic models of performance. We analytically describe the performance for single and multiple processors, both when different processors have the same parameter values and when they have different values. The use of different ranking algorithms and parameter values at different sites is examined.

[1] R.M. Losee, Evaluating Retrieval Performance Given Database and Query Characteristics: Analytic Determination of Performance Surfaces J. Am. Soc. for Information Science, vol. 47, no. 1, pp. 95-105, 1996.
[2] Text Retrieval and Filtering: Analytic Models of Performance. Boston: Kluwer, 1998.
[3] D.W. Harman, The First Text REtrieval Conf. (TREC-1), Nov. 1992, Information Processing and Management, vol. 29, no. 4, pp. 411-414, July-Aug. 1993.
[4] W. Meng, C. Yu, and K.-L. Liu, Building Efficient and Effective Metasearch Engines ACM Computing Surveys, vol. 34, no. 1, pp. 48-89, Mar. 2002.
[5] The TREC-8 Question Answering Track Report Proc. Eighth Text REtrieval Conf. (TREC-8), E.M. Voorhees and D.K.H., eds., Nat'l Inst. of Standards and Tech nology, 2000.
[6] A. Bookstein, Relevance J. Am. Soc. for Information Science, vol. 30, no. 5, pp. 269-273, 1979.
[7] D.R. Swanson, Subjective versus Objective Relevance in Bibliographic Retrieval Systems Library Quarterly, vol. 56, no. 4, pp. 389-398, Oct. 1986.
[8] L. Schamber, M. Eisenberg, and M.S. Nilan, A Re-Examination of Relevance: Toward a Dynamic, Situational Definition Information Processing and Management, vol. 26, no. 6, pp. 755-776, 1990.
[9] R. Tang and P. Solomon, Toward an Understanding of the Dynamics of Relevance Judgment: An Analysis of One Person's Search Behavior Information Processing and Management, vol. 34, nos. 2/3, pp. 237-256, 1998.
[10] R. Tang, J.L. Vevea, and W.M. Shaw, Towards the Identification of the Optimal Number of Relevance Categories J. Am. Soc. for Information Science, vol. 50, no. 3, pp. 254-264, 1999.
[11] K.L. Maglaughlin and D.H. Sonnenwald, User Perspectives on Relevance Critera: A Comparison Among Relevant, Partially Relevant, and Not-Relevant Judgments J. Am. Soc. for Information Science and Technology, vol. 53, no. 5, pp. 327-342, 2002.
[12] R.M. Losee and L.A.H. Paris, Measuring Search Engine Quality and Query Difficulty: Ranking with Target and Freestyle J. Am. Soc. for Information Science, vol. 50, no. 10, pp. 882-889, 1999.
[13] E.M. Voorhees, N.K. Gupta, and B. Johnson-Laird, The Collection Fusion Problem Proc. Third Text REtrieval Conf. (TREC-3), pp. 95-104, 1995.
[14] J. Savoy, A.L. Calve, and D. Vrajitoru, Report on the TREC-5 Experiment: Data Fusion and Collection Fusion Proc. Fifth Text REtrieval Conf. (TREC-5), pp. 489-502, 1997.
[15] G.G. Towell, E.M. Voorhees, N.K. Gupta, and B. Johnson-Laird, Learning Collection Fusion Strategies for Information Retrieval Proc. Int'l Conf. Machine Learning, pp. 540-548, 1995.
[16] J.C. French, A.L. Powell, and W.R. Creighton, Efficient Searching in Distributed Digital Libraries ACM Digital Library, pp. 283-284, 1998.
[17] Y. Rasolofo, Approaches to Collection Selection and Results Merging for Distributed Information Retrieval Proc. Conf. Information and Knowledge Management, pp. 191-198, Nov. 2001.
[18] W.S. Cooper, The Formalism of Probability Theory in IR: A Foundation or an Encumbrance Proc. 17th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 242-248, 1994.
[19] C.T. Yu and W. Meng, Principles of Database Query Processing for Advanced Applications. Calif.: Morgan Kaufmann Publishers, Inc., 1998.
[20] L. Gravano, C. Chang, H. Garcia-Molina, and A. Paepcke, STARTS: Stanford Proposal for Internet Meta-Searching Proc. ACM SIGMOD Conf., pp. 207-218, 1997.
[21] L. Gravano and Y. Papakonstantinou, Mediating and Metasearching on the Internet Data Eng. Bull., vol. 21, no. 2, pp. 28-36, 1998.
[22] D. Hawking, Efficiency/Effectiveness Trade-Offs in Query Processing ACM SIGIR Forum, vol. 32, no. 2, pp. 16-22, 1998.
[23] B. Chidlovskii and U.M. Borghoff, Query Translation for Distributed Information Processing on the Web Proc. Int'l Database Eng. and Application Symp., pp. 214-223, 1998.
[24] D. Kim, J. Lee, S. Lee, and C. Chung, Heterogeneous Multimedia Database Selection on the Web Korean Advanced Inst. of Science and Technology, Taejon, Korea, Technical Report CS/TR-2000-147, Feb. 2000.
[25] C. Baumgarten, Probabilistic Information Retrieval in a Distributed Heterogeneous Environment PhD dissertation, Dresden Univ. of Tech nology, 1999.
[26] L. Gravano and H. Garcia-Molina, Merging Ranks from Heterogeneous Internet Sources Proc. 23rd Very Large Databases Conf., pp. 196-205, 1997.
[27] N. Green, P.G. Ipeirotis, and L. Gravano, SDLIP + STARTS = SDARTS A Protocol and Toolkit for Metasearching Proc. ACM/IEEE Joint Conf. Digital Libraries, pp. 207-214, 2001.
[28] A. Paepcke, R. Brandriff, G. Janee, R. Larson, B. Ludaescher, et al. , Search Middleware and the Simple Digital Library Interoperability Protocol D Lib Magazine, vol. 6, no. 3, Nov. 2000.
[29] J.C. French, A.L. Powell, J.P. Callan, C.L. Viles, T. Emmitt, K.J. Prey, and Y. Mou, Comparing the Performance of Database Selection Algorithms Research and Development in Information Retrieval, pp. 238-245, 1999.
[30] J.C. French, A.L. Powell, and J. Callan, Effective and Efficient Automatic Database Selection Univ. of Virginia, Technical Report CS-99-08, 1999.
[31] J.C. French and A.L. Powell, Metrics for Evaluating Database Selection Techniques Univ. of Virginia, Technical Report CS-99-19, 1999.
[32] N. Craswell, Methods for Distributed Information Retrieval PhD dissertation, Australian Nat'l Univ., 2000.
[33] W. Meng, K. Liu, C. Yu, W. Wu, and N. Rishe, “Estimating the Usefulness of Search Engines,” Proc. IEEE Int'l Conf. Data Eng., pp. 146-153, 1999.
[34] K. Liu, C. Yu, W. Meng, W. Wu, and N. Rishe, “A Statistical Method for Estimating the Usefulness of Text Databases,” IEEE Trans. Knowledge and Data Eng., vol. 14, no. 6, Nov./Dec. 2002.
[35] N. Fuhr, A Decision-Theoretic Approach to Database Selection in Networked IR ACM Trans. Information Systems, vol. 17, no. 3, pp. 229-229, 1999.
[36] D. Hawking and P. Thistlewaite, Methods for Information Server Selection ACM Trans. Information Systems (TOIS), vol. 17, no. 1, pp. 40-76, 1999.
[37] W. Meng, Z. Wu, C. Yu, and Z. Li, A Highly Scalable and Effective Method for Metasearch ACM Trans. Information Systems, vol. 19, no. 3, pp. 310-335, July 2001.
[38] Z. Wu, W. Meng, C.T. Yu, and Z. Li, Towards a Highly-Scalable and Effective Metasearch Engine World Wide Web, pp. 386-395, 2001.
[39] J. Xu and J. Callan, Effective Retrieval with Distributed Collections Proc. 21st Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 112-120, 1998.
[40] C. Baumgarten, A Probabilistic Solution to the Selection and Fusion Problem in Distributed Information Retrieval Proc. 22nd Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 246-253, 1999.
[41] H. Garcia-Molina, J. Ullman, and J. Widom, Database Systems: The Complete Book. Prentice-Hall, Inc., 2002.
[42] M. Ozsu and P. Valduriez, Principles of Distributed Database Systems, second ed. Prentice-Hall, Inc., 1999.
[43] R.R. Korfhage, Information Storage and Retrieval. New York: John Wiley and Sons, Inc., 1997.
[44] R.E. Walpole, R.H. Myers, and S.L. Myers, Probability and Statistics for Engineers and Scientists. Saddle River, New Jersey: Prentice Hall, 1998.
[45] W.B. Croft and D.J. Harper, Using Probabilistic Models of Document Retrieval Without Relevance Information J. Documentation, vol. 35, no. 4, pp. 285-295, 1979.
[46] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval. New York: Addison Wesley, 1999.
[47] S.E. Robertson and K. Sparck Jones, Relevance Weighting of Search Terms J. Am. Soc. for Information Science, vol. 27, pp. 129-146, 1976.
[48] C. Van Rijsbergen, Information Retrieval, second ed. London: Butterworths, 1979.
[49] J.W. Pratt, H. Raiffa, and R. Schlaifer, Introduction to Statistical Decision Theory. Mass.: MIT Press, 1995.
[50] L. Kleinrock, Queueing Systems, Volume I: Theory. New York: Wiley Interscience, 1975.
[51] D. Hawking, N. Craswell, P. Bailey, and K. Griffiths, Measuring Search Engine Quality Information Retrieval, vol. 4, no. 1, pp. 33-59, 2001.
[52] C.T. Yu and G. Salton, Precision Weighting An Effective Automatic Indexing Method J. ACM, vol. 23, no. 1, pp. 76-88, 1976.
[53] K. Sparck Jones, A Statistical Interpretation of Term Specificity and Its Application in Retrieval J. Documentation, vol. 28, no. 1, pp. 11-21, 1972.
[54] C.T. Yu and G. Salton, Effective Information Retrieval Using Term Accuracy Comm. ACM, vol. 20, pp. 135-142, 1977.
[55] W.B. Croft and D. Harper, Using Probabilistic Models of Document Retrieval without Relevance Information J. Documentation, vol. 35, no. 4, pp. 285-295, Dec. 1979.

Index Terms:
Collection fusion, information retrieval, metasearch engines, distributed processing, analytic performance models, digital libraries.
Citation:
Robert M. Losee, Lewis Church Jr., "Information Retrieval with Distributed Databases: Analytic Models of Performance," IEEE Transactions on Parallel and Distributed Systems, vol. 15, no. 1, pp. 18-27, Jan. 2004, doi:10.1109/TPDS.2004.1264782
Usage of this product signifies your acceptance of the Terms of Use.