Subscribe

Issue No.08 - August (2011 vol.22)

pp: 1299-1306

Bin Tang , Wichita State University, Wichita

Liqiang Wang , University of Wyoming, Laramie

Dharma Teja Nukarapu , Wichita State University, Wichita

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TPDS.2010.207

ABSTRACT

Data replication has been well adopted in data intensive scientific applications to reduce data file transfer time and bandwidth consumption. However, the problem of data replication in Data Grids, an enabling technology for data intensive applications, has proven to be NP-hard and even non approximable, making this problem difficult to solve. Meanwhile, most of the previous research in this field is either theoretical investigation without practical consideration, or heuristics-based with little or no theoretical performance guarantee. In this paper, we propose a data replication algorithm that not only has a provable theoretical performance guarantee, but also can be implemented in a distributed and practical manner. Specifically, we design a polynomial time centralized replication algorithm that reduces the total data file access delay by at least half of that reduced by the optimal replication solution. Based on this centralized algorithm, we also design a distributed caching algorithm, which can be easily adopted in a distributed environment such as Data Grids. Extensive simulations are performed to validate the efficiency of our proposed algorithms. Using our own simulator, we show that our centralized replication algorithm performs comparably to the optimal algorithm and other intuitive heuristics under different network parameters. Using GridSim, a popular distributed Grid simulator, we demonstrate that the distributed caching technique significantly outperforms an existing popular file caching technique in Data Grids, and it is more scalable and adaptive to the dynamic change of file access patterns in Data Grids.

INDEX TERMS

Data intensive applications, Data Grids, data replication, algorithm design and analysis, simulations.

CITATION

Bin Tang, Liqiang Wang, Dharma Teja Nukarapu, "Data Replication in Data Intensive Scientific Applications with Performance Guarantee",

*IEEE Transactions on Parallel & Distributed Systems*, vol.22, no. 8, pp. 1299-1306, August 2011, doi:10.1109/TPDS.2010.207REFERENCES

- [1] The Large Hadron Collider, http://public.web.cern.ch/Public/en/LHCLHC-en.html , 2011.
- [2] Worldwide Lhc Computing Grid, http://lcg.web.cern.chLCG/, 2011.
- [3] A. Aazami, S. Ghandeharizadeh, and T. Helmi, "Near Optimal Number of Replicas for Continuous Media in Ad-Hoc Networks of Wireless Devices,"
Proc. Int'l Workshop Multimedia Information Systems, 2004.- [4] B. Allcock, J. Bester, J. Bresnahan, A.L. Chervenak, C. Kesselman, S. Meder, V. Nefedova, D. Quesnel, S. Tuecke, and I. Foster, "Secure, Efficient Data Transport and Replica Management for High-Performance Data-Intensive Computing,"
Proc. IEEE Symp. Mass Storage Systems and Technologies, 2001.- [5] I. Baev and R. Rajaraman, "Approximation Algorithms for Data Placement in Arbitrary Networks,"
Proc. ACM-SIAM Symp. Discrete Algorithms (SODA), 2001.- [6] I. Baev, R. Rajaraman, and C. Swamy, "Approximation Algorithms for Data Placement Problems,"
SIAM J. Computing, vol. 38, no. 4, pp. 1411-1429, 2008.- [7] W.H. Bell, D.G. Cameron, R. Cavajal-Schiaffino, A.P. Millar, K. Stockinger, and F. Zini, "Evaluation of an Economy-Based File Replication Strategy for a Data Grid,"
Proc. Int'l Workshop Agent Based Cluster Computing and Grid (CCGrid), 2003.- [8] D.G. Cameron, A.P. Millar, C. Nicholson, R. Carvajal-Schiaffino, K. Stockinger, and F. Zini, "Analysis of Scheduling and Replica Optimisation Strategies for Data Grids Using Optorsim,"
J. Grid Computing, vol. 2, no. 1, pp. 57-69, 2004.- [9] M. Carman, F. Zini, L. Serafini, and K. Stockinger, "Towards an Economy-Based Optimization of File Access and Replication on a Data Grid,"
Proc. Int'l Workshop Agent Based Cluster Computing and Grid (CCGrid), 2002.- [10] A. Chakrabarti and S. Sengupta, "Scalable and Distributed Mechanisms for Integrated Scheduling and Replication in Data Grids,"
Proc. 10th Int'l Conf. Distributed Computing and Networking (ICDCN), 2008.- [11] R.-S. Chang and H.-P. Chang, "A Dynamic Data Replication Strategy Using Access-Weight in Data Grids,"
J. Supercomputing, vol. 45, pp. 277-295, 2008.- [12] R.-S. Chang, J.-S. Chang, and S.-Y. Lin, "Job Scheduling and Data Replication on Data Grids,"
Future Generation Computer Systems, vol. 23, no. 7, pp. 846-860, Aug. 2007.- [13] A. Chebotko, X. Fei, C. Lin, S. Lu, and F. Fotouhi, "Storing and Querying Scientific Workflow Provenance Metadata Using an Rdbms,"
Proc. IEEE Int'l Conf. e-Science and Grid Computing, 2007.- [14] A. Chervenak, E. Deelman, M. Livny, M.-H. Su, R. Schuler, S. Bharathi, G. Mehta, and K. Vahi, "Data Placement for Scientific Applications in Distributed Environments,"
Proc. IEEE/ACM Int'l Conf. Grid Computing, 2007.- [15] A. Chervenak, R. Schuler, C. Kesselman, S. Koranda, and B. Moe, "Wide Area Data Replication for Scientific Collaboration,"
Proc. IEEE/ACM Int'l Workshop Grid Computing, 2005.- [16] A. Chervenak, R. Schuler, M. Ripeanu, M.A. Amer, S. Bharathi, I. Foster, and C. Kesselman, "The Globus Replica Location Service: Design and Experience,"
IEEE Trans. Parallel and Distributed Systems, vol. 20, no. 9, pp. 1260-1272, Sept. 2009.- [17] N.N. Dang and S.B. Lim, "Combination of Replication and Scheduling in Data Grids,"
Int'l J. Computer Science and Network Security, vol. 7, no. 3, pp. 304-308, Mar. 2007.- [18] D. Düllmann and B. Segal, "Models for Replica Synchronisation and Consistency in a Data Grid,"
Proc. 10th IEEE Int'l Symp. High Performance Distributed Computing (HPDC), 2001.- [19] J. Rehn et al., "Phedex: High-Throughput Data Transfer Management System,"
Proc. Computing in High Energy and Nuclear Physics (CHEP), 2006.- [20] I. Foster, "The Grid: A New Infrastructure for 21st Century Science,"
Physics Today, vol. 55, pp. 42-47, 2002.- [21] I. Foster and K. Ranganathan, "Decoupling Computation and Data Scheduling in Distributed Data-Intensive Applications,"
Proc. 11th IEEE Int'l Symp. High Performance Distributed Computing (HPDC), 2002.- [22] I. Foster, Y. Zhao, I. Raicu, and S. Lu, "Cloud Computing and Grid Computing 360-Degrees Compared,"
Proc. Grid Computing Environments Workshop, pp. 1-10, 2008.- [23] C. Intanagonwiwat, R. Govindan, and D. Estrin, "Directed Diffusion: A Scalable and Robust Communication Paradigm for Sensor Networks,"
Proc. ACM MobiCom, 2000.- [24] J.C. Jacob, D.S. Katz, T. Prince, G.B. Berriman, J.C. Good, A.C. Laity, E. Deelman, G. Singh, and M.-H Su, "The Montage Architecture for Grid-Enabled Science Processing of Large, Distributed Datasets,"
Proc. Earth Science Technology Conf., 2004.- [25] S. Jiang and X. Zhang, "Efficient Distributed Disk Caching in Data Grid Management,"
Proc. IEEE Int'l Conf. Cluster Computing, 2003.- [26] S. Jin and L. Wang, "Content and Service Replication Strategies in Multi-Hop Wireless Mesh Networks,"
Proc. ACM Int'l Conf. Modeling, Analysis and Simulation of Wireless and Mobile Systems (MSWiM), 2005.- [27] H. Lamehamedi, B.K. Szymanski, and B. Conte, "Distributed Data Management Services for Dynamic Data Grids," unpublished.
- [28] M. Lei, S.V. Vrbsky, and X. Hong, "An Online Replication Strategy to Increase Availability in Data Grids,"
Future Generation Computer Systems, vol. 24, pp. 85-98, 2008.- [29] C. Lin, S. Lu, X. Fei, A. Chebotko, D. Pai, Z. Lai, F. Fotouhi, and J. Hua, "A Reference Architecture for Scientific Workflow Management Systems and the View Soa Solution,"
IEEE Trans. Services Computing, vol. 2, no. 1, pp. 79-92, Jan.-Mar. 2009.- [30] M. Mineter, C. Jarvis, and S. Dowers, "From Stand-Alone Programs towards Grid-Aware Services and Components: A Case Study in Agricultural Modelling with Interpolated Climate Data,"
Environmental Modelling and Software, vol. 18, no. 4, pp. 379-391, 2003.- [31] S.M. Park, J.H. Kim, Y.B. Lo, and W.S. Yoon, "Dynamic Data Grid Replication Strategy Based on Internet Hierarchy,"
Proc. Second Int'l Workshop Grid and Cooperative Computing (GCC), 2003.- [32] J. Pérez, F. García-Carballeira, J. Carretero, A. Calderón, and J. Fernández, "Branch Replication Scheme: A New Model for Data Replication in Large Scale Data Grids,"
Future Generation Computer Systems, vol. 26, no. 1, pp. 12-20, 2010.- [33] L. Qiu, V.N. Padmanabhan, and G.M. Voelker, "On the Placement of Web Server Replicas,"
Proc. IEEE INFOCOM, 2001.- [34] I. Raicu, I. Foster, Y. Zhao, P. Little, C. Moretti, A. Chaudhary, and D. Thain, "The Quest for Scalable Support of Data Intensive Workloads in Distributed Systems,"
Proc. ACM Int'l Symp. High Performance Distributed Computing (HPDC), 2009.- [35] I. Raicu, Y. Zhao, I. Foster, and A. Szalay, "Accelerating Large-Scale Data Exploration through Data Diffusion,"
Proc. Int'l Workshop Data-Aware Distributed Computing (DADC), 2008.- [36] A. Ramakrishnan, G. Singh, H. Zhao, E. Deelman, R. Sakellariou, K. Vahi, K. Blackburn, D. Meyers, and M. Samidi, "Scheduling Data-Intensive Workflows onto Storage-Constrained Distributed Resources,"
Proc. Seventh IEEE Int'l Symp. Cluster Computing and the Grid (CCGRID), 2007.- [37] K. Ranganathan and I.T. Foster, "Identifying Dynamic Replication Strategies for a High-Performance Data Grid,"
Proc. Second Int'l Workshop Grid Computing (GRID), 2001.- [38] A. Rodriguez, D. Sulakhe, E. Marland, N. Nefedova, M. Wilde, and N. Maltsev, "Grid Enabled Server for High-Throughput Analysis of Genomes,"
Proc. Workshop Case Studies on Grid Applications, 2004.- [39] F. Schintke and A. Reinefeld, "Modeling Replica Availability in Large Data Grids,"
J. Grid Computing, vol. 2, no. 1, pp. 219-227, 2003.- [40] H. Stockinger, A. Samar, K. Holtman, B. Allcock, I. Foster, and B. Tierney, "File and Object Replication in Data Grids,"
Proc. 10th IEEE Int'l Symp. High Performance Distributed Computing (HPDC), 2001.- [41] B. Tang, S.R. Das, and H. Gupta, "Benefit-Based Data Caching in Ad Hoc Networks,"
IEEE Trans. Mobile Computing, vol. 7, no. 3, pp. 289-304, Mar. 2008.- [42] M. Tang, B.-S. Lee, C.-K. Yeo, and X. Tang, "Dynamic Replication Algorithms for the Multi-Tier Data Grid,"
Future Generation Computer Systems, vol. 21, pp. 775-790, 2005.- [43] M. Tang, B.-S. Lee, C.-K. Yeo, and X. Tang, "The Impact of Data Replication on Job Scheduling Performance in the Data Grid,"
Future Generation Computer Systems, vol. 22, pp. 254-268, 2006.- [44] U. Čibej, B. Slivnik, and B. Robič, "The Complexity of Static Data Replication in Data Grids,"
Parallel Computing, vol. 31, nos. 8/9, pp. 900-912, 2005.- [45] S. Venugopal and R. Buyya, "An Scp-Based Heuristic Approach for Scheduling Distributed Data-Intensive Applications on Global Grids,"
J. Parallel and Distributed Computing, vol. 68, pp. 471-487, 2008.- [46] S. Venugopal, R. Buyya, and K. Ramamohanarao, "A Taxonomy of Data Grids for Distributed Data Sharing, Management, and Processing,"
ACM Computing Surveys, vol. 38, no. 1, 2006.- [47] X. You, G. Chang, X. Chen, C. Tian, and C. Zhu, "Utility-Based Replication Strategies in Data Grids,"
Proc. Fifth Int'l Conf. Grid and Cooperative Computing (GCC), 2006. |