This Article 
 Bibliographic References 
 Add to: 
An Online Data Access Prediction and Optimization Approach for Distributed Systems
June 2012 (vol. 23 no. 6)
pp. 1017-1029
Renato Porfirio Ishii, Federal University of Mato Grosso do Sul-UFMS, Campo Grande
Rodrigo Fernandes de Mello, University of São Paulo, São Carlos
Current scientific applications have been producing large amounts of data. The processing, handling and analysis of such data require large-scale computing infrastructures such as clusters and grids. In this area, studies aim at improving the performance of data-intensive applications by optimizing data accesses. In order to achieve this goal, distributed storage systems have been considering techniques of data replication, migration, distribution, and access parallelism. However, the main drawback of those studies is that they do not take into account application behavior to perform data access optimization. This limitation motivated this paper which applies strategies to support the online prediction of application behavior in order to optimize data access operations on distributed systems, without requiring any information on past executions. In order to accomplish such a goal, this approach organizes application behaviors as time series and, then, analyzes and classifies those series according to their properties. By knowing properties, the approach selects modeling techniques to represent series and perform predictions, which are, later on, used to optimize data access operations. This new approach was implemented and evaluated using the OptorSim simulator, sponsored by the LHC-CERN project and widely employed by the scientific community. Experiments confirm this new approach reduces application execution time in about 50 percent, specially when handling large amounts of data.

[1] M. Parashar and S. Hariri, Autonomic Computing: Concepts, Infrastructure, and Applications, Manish Parashar and Salim Hariri eds. Taylor & Francis, Inc., 2007.
[2] H. Stockinger, "Defining the Grid: A Snapshot on the Current View," The J. Supercomputing, vol. 42, pp. 3-17, , 2007.
[3] T. Hey and A.E. Trefethen, "Cyberinfrastructure for E-Science," Science, vol. 308, no. 5723, pp. 817-821, http://www.sciencemag. org/cgi/content/abstract/ 308/5723817, 2005.
[4] R. Ranjan, A. Harwood, and R. Buyya, "A Study on Peer-to-Peer Based Discovery of Grid Resource Information," technical report, Dept. of Computer Science and Software Eng., Univ. of Melbourne, Dec. 2006.
[5] J. Gray, "eScience: A Transformed Scientific Method," The Fourth Paradigm: Data-Intensive Scientific Discovery, T. Hey, S. Tansley, and K. Tolle, eds., Microsoft Research, 2009.
[6] J.P. Collins, "Sailing on an Ocean of 0s and 1s," Science, vol. 327, pp. 1455-1456, Mar. 2010.
[7] G. Fox and D. Gannon, "Computational Grids," Computing in Science and Eng., vol. 3, no. 4, pp. 74-77, 2001.
[8] M. Nielsen, "A Guide to the Day of Big Data," Nature, vol. 462, pp. 722-723, Dec. 2009.
[9] R.P. Ishii and R.F. de Mello, "An Adaptive and Historical Approach to Optimize Data Access in Grid Computing Environments," INFOCOMP J. Computer Science, vol. 10, no. 2, pp. 26-43, http://www.dcc.ufla.brinfocomp/, 2011.
[10] N.N. Dang and S.B. Lim, "Combination of Replication and Scheduling in Data Grids," Int'l J. Computer Science and Network Security, vol. 7, no. 3, pp. 304-308, Mar. 2007.
[11] R.M. Rahman, K. Barker, and R. Alhajj, "A Predictive Technique for Replica Selection in Grid Environment," Proc. IEEE Seventh Int'l Symp. Cluster Computing and Grid, pp. 163-170, May 2007.
[12] Y. Sun and Z. Xu, "Grid Replication Coherence Protocol," Proc. 18th Int'l Symp. Parallel and Distributed Processing, pp. 232-239, Apr. 2004.
[13] W.H. Bell, D.G. Cameron, A.P. Millar, L. Capozza, K. Stockinger, and F. Zini, "Optorsim: A Grid Simulator for Studying Dynamic Data Replication Strategies," Int'l J. High Performance Computing Applications, vol. 17, no. 4, pp. 403-416, 17/4403, 2003.
[14] H.H.E. AL-Mistarihi and C.H. Yong, "On Fairness, Optimizing Replica Selection in Data Grids," IEEE Trans. Parallel Distributed Systems, vol. 20, no. 8, pp. 1102-1111, Aug. 2009.
[15] G. Belalem, "Economic Model for Consistency Management of Replicas in Data Grids with Optorsim Simulator," Proc. Second Int'l Conf. Networks for Grid Applications (GridNets), O. Akan, P. Bellavista, J. Cao, F. Dressler, D. Ferrari, M. Gerla, H. Kobayashi, S. Palazzo, S. Sahni, X. S. Shen, M. Stan, J. Xiaohua, A. Zomaya, G. Coulson, P. Vicat-Blanc Primet, T. Kudoh, and J. Mambretti, eds., pp. 121-129, 10.1007/978-3-642-0280-3_13. , 2009.
[16] R.P. Ishii and R.F. de Mello, "A History-Based Heuristic to Optimize Data Access in Distributed Environments," Proc. 21st IASTED Int'l Conf. Parallel and Distributed Computing and Systems (PDCS '09), Nov. 2009.
[17] X.Y. Ren, R.C. Wang, and Q. Kong, "Using Optorsim to Efficiently Simulate Replica Placement Strategies," The J. China Univ. of Posts and Telecomm., vol. 17, no. 1, pp. 111-119, http://www. piiS1005888509604349, 2010.
[18] K. Jain, A.V. Vidhate, V. Wangikar, and S. Shah, "Design of File Size and Type of Access Based Replication Algorithm for Data Grid," Proc. Int'l Conf. and Workshop Emerging Trends in Technology (ICWET '11), pp. 315-319, , 2011.
[19] K. Sashi and A.S. Thanamani, "Dynamic Replication in a Data Grid Using a Modified Bhr Region Based Algorithm," Future Generation Computer Systems, vol. 27, pp. 202-210, , Feb. 2011.
[20] R.P. Ishii, R.A. Rios, and R.F. de Mello, "Classification of Time Series Generation Processes Using Experimental Tools: A Survey and Proposal of an Automatic and Systematic Approach," Int'l J. Computational Science and Eng., vol. 6, pp. 217-237, , 2011.
[21] M. Devarakonda and R. Iyer, "Predictability of Process Resource Usage: A Measurement-Based Study on Unix," IEEE Trans. Software Eng., vol. 15, no. 2, pp. 1579-1586,, Dec. 1989.
[22] M. Faerman, A. Su, R. Wolski, and F. Berman, "Adaptive Performance Prediction for Distributed Data-intensive Applications," Proc. ACM/IEEE Conf. Supercomputing (Supercomputing '99), p. 36, 1999.
[23] M. Wang, K. Au, A. Ailamaki, A. Brockwell, C. Faloutsos, and G.R. Ganger, "Storage Device Performance Prediction with Cart Models," Proc. IEEE CS 12th Ann. Int'l Symp. Modeling, Analysis, and Simulation of Computer and Telecomm. Systems (MASCOTS '04), pp. 588-595, 2004.
[24] L. Senger, R.F. Mello, M.J. Santana, and R.H.C. Santana, "An On-Line Approach for Classifying and Extracting Application Behavior on Linux," High Performance Computing: Paradigm and Infrastructure, pp. 381-401, John Wiley and Sons Inc., 2005.
[25] L.J. Senger, M. Santana, and R. Santana, "An Instance-based Learning Approach for Predicting Parallel Applications Execution Times," Proc. Third Int'l Information and Telecomm. Technologies Symp., pp. 9-15, Dec. 2005.
[26] R.F. de Mello, "Sistemas dinâmicos e técnicas inteligentes para a predição de Comportamento de Processos: Uma Abordagem Para Otimização de Escalonamento em Grades Computacionais," PhD dissertation, Intituto de Ciências Matemáticas e de Computaćão - USP, Jan. 2009.
[27] R. Oldfield and D. Kotz, "Improving Data Access for Computational Grid Applications," Cluster Computing, vol. 9, no. 1, pp. 79-99, Jan. 2006.
[28] J. Kim, A. Chandra, and J.B. Weissman, "Using Data Accessibility for Resource Selection in Large-Scale Distributed Systems," IEEE Trans. Parallel Distributed Systems, vol. 20, no. 6, pp. 788-801, June 2009.
[29] A.L. Chervenak, R. Schuler, M. Ripeanu, M.A. Amer, S. Bharathi, I. Foster, A. Iamnitchi, and C. Kesselman, "The Globus Replica Location Service: Design and Experience," IEEE Trans. Parallel Distributed Systems, vol. 20, no. 9, pp. 1260-1272, Sept. 2009.
[30] A. Abramovici, W.E. Althouse, R.W.P. Drever, Y. Grsel, S. Kawamura, F.J. Raab, D. Shoemaker, L. Sievers, R.E. Spero, K.S. Thorne, R.E. Vogt, R. Weiss, S.E. Whitcomb, and M.E. Zucker, "LIGO: The Laser Interferometer Gravitational-Wave Observatory," Science, vol. 256, no. 5055, pp. 325-333, 1992.
[31] F.A.R. Nat'l Center, "The Earth System Grid-ESG Project," www., 2005.
[32] E. Deelman, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, K. Vahi, K. Blackburn, A. Lazzarini, A. Arbree, R. Cavanaugh, and S. Koranda, "Mapping Abstract Complex Workflows onto Grid Environments," J. Grid Computing, vol. 1, no. 1, pp. 25-39, Mar. 2003.
[33] R.H. Shumway and D.S. Stoffer, Time Series Analysis and Its Applications: With R Examples, second ed. Springer, May 2006.
[34] G. Box, G.M. Jenkins, and G. Reinsel, Time Series Analysis: Forecasting & Control, third ed. Prentice Hall, Feb. 1994.
[35] H. White, "A heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity," Econometrica, vol. 48, no. 4, pp. 817-838,, 1980.
[36] T.M. Mitchell, Machine Learning. McGraw-Hill, 1997.
[37] S. Haykin, Neural Networks and Learning Machines, third ed. Prentice-Hall, 2008.
[38] M. Casdagli, "Nonlinear Prediction of Chaotic Time Series," Physica D: Nonlinear Phenomena, vol. 35, no. 3, pp. 335-356, May 1989.
[39] C. Jung, D.-K. Woo, K. Kim, and S.-S. Lim, "Performance Characterization of Prelinking and Preloadingfor Embedded Systems," Proc. ACM and IEEE Seventh Int'l Conf. Embedded Software, pp. 213-220, 2007.
[40] R.P. Spillane, C.P. Wright, G. Sivathanu, and E. Zadok, "Rapid File System Development Using Ptrace," Proc. Workshop Experimental Computer Science, p. 22, 2007.
[41] N. Marwan, M. Carmen Romano, M. Thiel, and J. Kurths, "Recurrence Plots for the Analysis of Complex Systems," Physics Reports, vol. 438, nos. 5/6, pp. 237-329, , Jan. 2007.
[42] T.-H. Lee, H. White, and C.W.J. Granger, "Testing for Neglected Nonlinearity in Time Series Models: A Comparison of Neural Network Methods and Alternative Tests," J. Econometrics, vol. 56, pp. 269-290, 1993.
[43] D. Yu, W. Lu, and R.G. Harrison, "Space Time-Index Plots for Probing Dynamical Nonstationarity," Physics Letters A, vol. 250, nos. 4-6, pp. 323-327, 1998.
[44] F. Takens, "Detecting Strange Attractors in Turbulence," Dynamical Systems and Turbulence, vol. 1, pp. 366-381,, 1981.
[45] M. Anderson and W. Woessner, Applied Groundwater Modeling: Simulation of Flow and Advective Transport, second ed. Academic Press., 1992.
[46] D. Narayanan, A. Donnelly, and A. Rowstron, "Write Off-Loading: Practical Power Management for Enterprise Storage," Trans. Storage, vol. 4, no. 3, pp. 1-23, 2008.
[47] A. Provanzale, L.A. Smith, R. Vio, and G. Murante, "Distinguishing Between Low-Dimensional Dynamics and Randomness in Measured Time Series," Phys. D, vol. 58, nos. 1-4, pp. 31-49, 1992.
[48] G. Box, G.M. Jenkins, and G. Reinsel, Time Series Analysis: Forecasting & Control, third ed. Prentice Hall, Feb. 1994.
[49] J.P. Zbilut and C.L. WebberJr., "Embeddings and Delays as Derived from Quantification of Recurrence Plots," Physics Letters A, vol. 171, nos. 3/4, pp. 199-203, Dec. 1992.
[50] R.J. Hyndman and Y. Khandakar, "Automatic Time Series Forecasting: The Forecast Package for r," J. Statistical Software, vol. 27, no. 3, pp. 1-22,, 2008.
[51] R. Hegger, H. Kantz, and T. Schreiber, "Practical Implementation of Nonlinear Time Series Methods: The TISEAN Package," Chaos, vol. 9, pp. 413-435, 1998.

Index Terms:
Distributed computing, distributed file system, data access optimization, time series analysis, prediction.
Renato Porfirio Ishii, Rodrigo Fernandes de Mello, "An Online Data Access Prediction and Optimization Approach for Distributed Systems," IEEE Transactions on Parallel and Distributed Systems, vol. 23, no. 6, pp. 1017-1029, June 2012, doi:10.1109/TPDS.2011.256
Usage of this product signifies your acceptance of the Terms of Use.