This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Optimizing the Execution of Multiple Data Analysis Queries on Parallel and Distributed Environments
June 2004 (vol. 15 no. 6)
pp. 520-532

Abstract—This paper investigates techniques for efficiently executing multiquery workloads from data and computation-intensive applications in parallel and/or distributed computing environments. In this context, we describe a database optimization framework that supports data and computation reuse, query scheduling, and active semantic caching to speed up the evaluation of multiquery workloads. Its most striking feature is the ability of optimizing the execution of queries in the presence of application-specific constructs by employing a customizable data and computation reuse model. Furthermore, we discuss how the proposed optimization model is flexible enough to work efficiently irrespective of the parallel/distributed environment underneath. In order to evaluate the proposed optimization techniques, we present experimental evidence using real data analysis applications. For this purpose, a common implementation for the queries under study was provided according to the database optimization framework and deployed on top of three distinct experimental configurations: a shared memory multiprocessor, a cluster of workstations, and a distributed computational Grid-like environment.

[1] A. Afework, M.D. Beynon, F. Bustamante, A. Demarzo, R. Ferreira, R. Miller, M. Silberman, J. Saltz, A. Sussman, and H. Tsang, Digital Dynamic Telepathology The Virtual Microscope Proc. 1998 AMIA Ann. Fall Symp., Nov. 1998.
[2] B. Allcock, I. Foster, V. Nefedova, A. Chervenak, E. Deelman, C. Kesselman, J. Lee, A. Sim, A. Shoshani, B. Drach, and D. Williams, High-Performance Remote Access to Climate Simulation Data: A Challenge Problem for Data Grid Technologies Proc. 2001 ACM/IEEE Supercomputing Conf., Nov. 2001.
[3] K. Amiri, D. Petrou, G.R. Ganger, and G.A. Gibson, Dynamic Function Placement for Data-Intensive Cluster Computing Proc. 2000 USENIX Symp. Internet Technologies and Systems, 2000.
[4] H. Andrade, Multiple Query Optimization Support for Data Analysis Applications PhD thesis, Dept. of Computer Science, Univ. of Maryland, Dec. 2002.
[5] H. Andrade, T. Kurc, A. Sussman, E. Borovikov, and J. Saltz, On Cache Replacement Policies for Servicing Mixed Data Intensive Query Workloads Proc. Second Workshop Caching, Coherence, and Consistency, held in conjunction with the Proc. 16th ACM Int'l Conf. Supercomputing, June 2002.
[6] H. Andrade, T. Kurc, A. Sussman, and J. Saltz, Efficient Execution of Multiple Workloads in Data Analysis Applications Proc. 2001 ACM/IEEE Supercomputing Conf., Nov. 2001.
[7] H. Andrade, T. Kurc, A. Sussman, and J. Saltz, Active Proxy-G: Optimizing the Query Execution Process in the Grid Proc. 2002 ACM/IEEE Supercomputing Conf., Nov. 2002.
[8] H. Andrade, T. Kurc, A. Sussman, and J. Saltz, Multiple Query Optimization for Data Analysis Applications on Clusters of SMPs Proc. Second IEEE/ACM Int'l Symp. Cluster Computing and the Grid, May 2002.
[9] H. Andrade, T. Kurc, A. Sussman, and J. Saltz, Scheduling Multiple Data Visualization Query Workloads on a Shared Memory Machine Proc. 2002 IEEE Int'l Parallel and Distributed Processing Symp., Apr. 2002.
[10] H. Andrade, T. Kurc, A. Sussman, and J. Saltz, Exploiting Functional Decomposition for Efficient Parallel Processing of Multiple Data Analysis Queries Proc. 2003 IEEE Int'l Parallel and Distributed Processing Symp., Apr. 2003.
[11] R.H. Arpaci-Dusseau, E. Anderson, N. Treuhaft, D.E. Culler, J.M. Hellerstein, D. Patterson, and K. Yelick, Cluster I/O with River: Making the Fast Case Common Proc. Sixth Workshop I/O and Parallel and Distributed Systems, 1999.
[12] W. Bethel, B. Tierney, J. Lee, D. Gunter, and S. Lau, Using High-Speed WANs and Network Data Caches to Enable Remote and Distributed Visualization Proc. 2000 ACM/IEEE Supercomputing Conf., Nov. 2000.
[13] M.D. Beynon, A. Sussman, and J. Saltz, Performance Impact of Proxies in Data Intensive Client-Server Applications Proc. 1999 Int'l Conf. Supercomputing, June 1999.
[14] E. Borovikov, A. Sussman, and L. Davis, A High Performance Multi-Perspective Vision Studio Proc. 2003 Int'l Conf. Supercomputing, June 2003.
[15] R. Bramley, K. Chiu, S. Diwan, D. Gannon, M. Govindaraju, N. Mukhi, B. Temko, and M. Yechuri, “A Component Based Services Architecture for Building Distributed Applications,” Proc. IEEE Int'l High Performance Distributed Computing Symp. (HPDC), Aug. 2000.
[16] R. Braumandl, M. Keidl, A. Kemper, D. Kossmann, A. Kreutz, S. Seltzsam, and K. Stocker, ObjectGlobe: Ubiquitous Query Processing on the Internet VLDB J.: special issue on E-Services, vol. 10, no. 1, pp. 48-71, 2001.
[17] Common Component Architecture Forum,http:/www.cca-forum.org, 2004.
[18] U.S. Chakravarthy and J. Minker, Multiple Query Processing in Deductive Databases Using Query Graphs Proc. 12th Very Large Databases Conf., pp. 384-391, 1986.
[19] C. Chang, Parallel Aggregation on Multi-Dimensional Scientific Datasets PhD thesis, Dept. of Computer Science, Univ. of Maryland, Apr. 2001.
[20] C. Chang, B. Moon, A. Acharya, C. Shock, A. Sussman, and J. Saltz, “Titan: A High-Performance Remote-Sensing Database,” Proc. Int'l Conf. Data Eng., pp. 375-384, 1997.
[21] I. Foster, C. Kesselman, J. Nick, and S. Tuecke, The Physiology of the Grid An Open Grid Services Architecture for Distributed Systems Integration http://www.globus.org/research/papersogsa.pdf , 2002.
[22] W. Gropp, E. Lusk, and A. Skjellum, Using MPI: Portable Parallel Programming with Message-Passing Interface. MIT Press, 1999.
[23] S. Kalluri, Z. Zhang, J. JáJá, D. Bader, N.E. Saleous, E. Vermote, and J.R.G. Townshend, A Hierarchical Data Archiving and Processing System to Generate Custom Tailored Products from AVHRR Data Proc. 1999 IEEE Int'l Geoscience and Remote Sensing Symp., pp. 2374-2376, 1999.
[24] M.H. Kang, H.G. Dietz, and B.K. Bhargava, Multiple-Query Optimization at Algorithm-Level Data and Knowledge Eng., vol. 14, no. 1, pp. 57-75, 1994.
[25] J. Kubiatowicz, D. Bindel, Y. Chen, S. Czerwinski, P. Eaton, D. Geels, R. Gummadi, S. Rhea, H. Weatherspoon, W. Weimer, C. Wells, and B. Zhao, OceanStore: An Architecture for Global-Scale Persistent Storage Proc. Ninth Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 190-201, Nov. 2000.
[26] T. Kurc, U. Catalyurek, C. Chang, A. Sussman, and J. Saltz, Visualization of Very Large Data Sets with the Active Data Repository IEEE Computer Graphics and Applications, vol. 21, no. 4, pp. 22-33, July/Aug. 2001.
[27] T. Kurc, C. Chang, R. Ferreira, A. Sussman, and J. Saltz, Querying Very Large Multi-Dimensional Data Sets in ADR Proc. 1999 ACM/IEEE Supercomputing Conf., Nov. 1999.
[28] Monitoring and Discovery Service http://www.globus.org/ogsa/TechResources MDS.html, 2004.
[29] D.A. Menascé and V.A.F. Almeida, Scaling for E-Business. Prentice Hall, 2000.
[30] B. Moon, A. Acharya, and J. Saltz, “Study of Scalable Declustering Algorithms for Parallel Grid Files,” Proc. Int'l Parallel Processing Symp., 1996.
[31] National Oceanic and Atmospheric Administration, NOAA, Polar Orbiter User's Guide Nov. 1998 Revision, compiled and edited by K.B. Kidwell,http://www2.ncdc.noaa.gov/docs/podugcover.htm , 1998.
[32] R. Oldfield and D. Kotz, Armada: A Parallel File System for the Computational Grid Proc. First IEEE Int'l Symp. Cluster Computing and the Grid, May 2001.
[33] M. Rodríguez-Martínez and N. Roussopoulos, MOCHA: A Self-Extensive Database Middleware System for Distributed Data Sources Proc. 2000 ACM-SIGMOD Conf., pp. 213-224, 2000.
[34] D.P. Roy, L. Giglio, J.D. Kendall, and C. Justice, Multi-Temporal Active-Fire Based Burn Scar Detection Algorithm Int'l J. Remote Sensing, vol. 20, no. 5, pp. 1031-1038, 1999.
[35] P. Roy, S. Seshadri, S. Sudarshan, and S. Bhobe, Efficient and Extensible Algorithms for Multi Query Optimization Proc. 2000 ACM-SIGMOD Conf., pp. 249-260, 2000.
[36] K. Shim, T.K. Sellis, and D. Nau, Improvements on a Heuristic Algorithm for Multiple Query Optimization Data and Knowledge Eng., vol. 12, pp. 197-222, 1994.
[37] K.L. Tan and H. Lu, Workload Scheduling for Multiple Query Processing Information Processing Letters, vol. 55, no. 5, pp. 251-257, 1995.
[38] A.S. Tanenbaum, Structured Computer Organization. Prentice Hall, 1999.
[39] D. Wessels and K. Claffy, "ICP and the Squid Web Cache," IEEE J. Selected Areas in Comm., vol. 16, no. 3, Apr. 1998.
[40] R. Wolski, J. Brevik, C. Krintz, G. Obertelli, N. Spring, and A. Su, Running Everyware on the Computational Grid Proc. 1999 ACM/IEEE Supercomputing Conf., Nov. 1999.

Index Terms:
Multiquery optimization, parallel databases, data analysis applications, symmetric multiprocessing, cluster computing, grid computing.
Citation:
Henrique Andrade, Tahsin Kurc, Alan Sussman, Joel Saltz, "Optimizing the Execution of Multiple Data Analysis Queries on Parallel and Distributed Environments," IEEE Transactions on Parallel and Distributed Systems, vol. 15, no. 6, pp. 520-532, June 2004, doi:10.1109/TPDS.2004.11
Usage of this product signifies your acceptance of the Terms of Use.