Subscribe

Issue No.07 - July (2012 vol.23)

pp: 1288-1301

Jorge E. Pezoa , Universidad de Concepción, Concepción

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TPDS.2011.285

ABSTRACT

Average service time, quality-of-service (QoS), and service reliability associated with heterogeneous parallel and distributed computing systems (DCSs) are analytically characterized in a realistic setting for which tangible, stochastic communication delays are present with nonexponential distributions. The departure from the traditionally assumed exponential distributions for event times, such as task-execution times, communication arrival times and load-transfer delays, gives rise to a non-Markovian dynamical problem for which a novel age dependent, renewal-based distributed queuing model is developed. Numerical examples offered by the model shed light on the operational and system settings for which the Markovian setting, resulting from employing an exponential-distribution assumption on the event times, yields inaccurate predictions. A key benefit of the model is that it offers a rigorous framework for devising optimal dynamic task reallocation (DTR) policies systematically in heterogeneous DCSs by optimally selecting the fraction of the excess loads that need to be exchanged among the servers, thereby controlling the degree of cooperative processing in a DCSs. Key results on performance prediction and optimization of DCSs are validated using Monte-Carlo (MC) simulation as well as experiments on a distributed computing testbed. The scalability, in the number of servers, of the age-dependent model is studied and a linearly scalable analytical approximation is derived.

INDEX TERMS

Renewal theory, non-Markovian processes, distributed queuing theory, reliability, distributed computing, load balancing.

CITATION

Jorge E. Pezoa, "Performance and Reliability of Non-Markovian Heterogeneous Distributed Computing Systems",

*IEEE Transactions on Parallel & Distributed Systems*, vol.23, no. 7, pp. 1288-1301, July 2012, doi:10.1109/TPDS.2011.285REFERENCES

- [1] G. Bolch, S. Greiner, H. de Meer, and K.S. Trivedi,
Queueing Networks and Markov Chains, second ed. John Wiley and Sons, Inc., 2006.- [2] Y.-S. Dai and G. Levitin, "Optimal Resource Allocation for Maximizing Performance and Reliability in Tree-Structured Grid Services,"
IEEE Trans. Reliability, vol. 56, no. 3, pp. 444-453, Sept. 2007.- [3] Y.-S. Dai, G. Levitin, and K. Trivedi, "Performance and Reliability of Tree-Structured Grid Services Considering Data Dependence and Failure Correlation,"
IEEE Trans. Computers, vol. 56, no. 7, pp. 925-936, July 2007.- [4] A. Bobbio, A. Puliafito, and M. Tekel, "A Modeling Framework to Implement Preemption Policies in Non-Markovian SPNs,"
IEEE Trans. Software Eng., vol. 26, no. 1, pp. 36-54, Jan. 2000.- [5] M. Bouissou and J.-L. Bonc, "A New Formalism that Combines Advantages of Fault-Trees and Markov Models: Boolean Logic Driven Markov Processes,"
Reliability Eng. and System Safety, vol. 82, no. 2, pp. 149-163, 2003.- [6] M.C. Kim and P.H. Seong, "Reliability Graph with General Gates: An Intuitive and Practical Method for System Reliability Analysis,"
Reliability Eng. and System Safety, vol. 78, no. 3, pp. 239-246, 2002.- [7] X. Tanga, K. Li, R. Li, and B. Veeravalli, "Reliability-Aware Scheduling Strategy for Heterogeneous Distributed Computing Systems,"
J. Parallel and Distributed Computing, vol. 70, no. 11, pp. 941-952, 2010.- [8] S. Ali, H.J. Siegel, M. Maheswaran, S. Ali, and D. Hensgen, "Task Execution Time Modeling for Heterogeneous Computing Systems,"
Proc. Ninth Heterogeneous Computing Workshop (HCW '00), pp. 185-199, 2000.- [9] D. Vidyarthi and A. Tripathi, "Maximizing Reliability of a Distributed Computing System with Task Allocation Using Simple Genetic Algorithm,"
J. Systems Architecture, vol. 47, pp. 549-554, 2001.- [10] Y. Hamam and K. Hindi, "Assignment of Program Tasks to Processors: A Simulated Annealing Approach,"
J. Operational Research, vol. 122, pp. 509-513, 2000.- [11] G. Attiya and Y. Hamam, "Task Allocation for Maximizing Reliability of Distributed Systems: A Simulated Annealing Approach,"
J. Parallel and Distributed Computing, vol. 66, pp. 1259-1266, 2006.- [12] L.E. Holloway, B.H. Krogh, and A. Giua, "A Survey of Petri Net Methods for Controlled Discrete Event Systems,"
Discrete Event Dynamic Systems, vol. 7, no. 2, pp. 151-190, 1997.- [13] S. Pllana, I. Brandic, and S. Benkner, "Performance Modeling and Prediction of Parallel and Distributed Computing Systems: A Survey of the State of the Art,"
Proc. First Int'l Conf. Complex, Intelligent and Software Intensive Systems (CISIS '07), pp. 279-284, 2007.- [14] M.M. Hayat, S. Dhakal, C.T. Abdallah, J.D. Birdwell, and J. Chiasson, "Dynamic Time Delay Models for Load Balancing. Part II: Stochastic Analysis of the Effect of Delay Uncertainty,"
Advances in Time Delay Systems, pp. 355-368, Springer-Verlag, 2004.- [15] J. Al-Jaroodi, N. Mohamed, H. Jiang, and D. Swanson, "Modeling Parallel Applications Performance on Heterogeneous Systems,"
Proc. 17th Int'l Symp. Parallel and Distributed Processing (IPDPS), p. 160.2, 2003.- [16] J.E. Pezoa, S. Dhakal, and M.M. Hayat, "Maximizing Service Reliability in Distributed Computing Systems with Random Failures: Theory and Implementation,"
IEEE Trans. Parallel and Distributed Systems, vol. 21, no. 10, pp. 1531-1544, Oct. 2010.- [17] C. Kelling, "TimeNET-Sim-a Parallel Simulator for Stochastic Petri Nets,"
Proc. 28th Ann. Simulation Symp., 1995.- [18] A. Bobbio, A. Puliafito, M. Scarpa, and M. Telek, "WebSPN: A Web-Accessible Petri Net Tool,"
Proc. Conf. Web-Based Modeling and Simulation, 1998.- [19] S. Haddad and P. Moreaux, "Approximate Analysis of Non-Markovian Stochastic Systems with Multiple Time Scale Delays,"
Proc. IEEE CS 12th Ann. Int'l Symp. Modeling, Analysis, and Simulation of Computer and Telecomm. Systems (MASCOTS '04), pp. 23-30, 2004.- [20] D. Logothetis, V. Mainkar, and K. Trivedi, "Transient Analysis of Non-Markovian Queues via Markov Regenerative Processes,"
Probability and Statistics—A.J. Medhi Festschrift, pp. 109-131, 1996.- [21] S. Dhakal, M.M. Hayat, J.E. Pezoa, C. Yang, and D.A. Bader, "Dynamic Load Balancing in Distributed Systems in the Presence of Delays: A Regeneration-Theory Approach,"
IEEE Trans. Parallel and Distributed Systems, vol. 18, no. 4, pp. 485-497, Apr. 2007.- [22] G. Levitin and Y.-S. Dai, "Service Reliability and Performance in Grid System with Star Topology,"
Reliability Eng. and System Safety, vol. 92, pp. 40-46, 2007.- [23] D.R. Cox, "The Analysis of Non-Markovian Stochastic Processes by the Inclusion of Supplementary Variables,"
Proc. Cambridge Philosophical Soc., vol. 51, pp. 433-441, 1965.- [24] K.S. Trivedi, A. Bobbio, G. Ciardo, R. German, A. Puliafito, and M. Telek, "Non-Markovian Petri Nets,"
Proc. ACM Int'l Conf. Measurement and Modeling of Computer Systems (SIGMETRICS), pp. 263-264, 1995.- [25] R. German, "Non-Markovian Analysis," pp. 156-182, 2002.
- [26] R. German, "Markov Regenerative Stochastic Petri Nets with General Execution Policies: Supplementary Variable Analysis and a Prototype Tool,"
Performance Evaluation, vol. 39, nos. 1-4, pp. 165-188, 2000.- [27] P. Bazan and R. German, "An Iterative Approximate Analysis Method for Non-Markovian Models Based on Supplementary Variables,"
Proc. 12th GI/ITG Conf. Measuring, Modelling and Evaluation of Computer and Comm. Systems (MMB '04), pp. 255-264, 2004.- [28] M. Telek and A. Horváth, "Transient Analysis of Age-MRSPNs by the Method of Supplementary Variables,"
Performance Evaluation, vol. 45, no. 4, pp. 205-221, 2001.- [29] J.E. Pezoa, M.M. Hayat, Z. Wang, and S. Dhakal, "Optimal Task Reallocation in Heterogeneous Distributed Computing Systems with Age-Dependent Delay Statistics,"
Proc. 39th Int'l Conf. Parallel Processing (ICPP), 2010.- [30] Z. Tang, J.D. Birdwell, J. Chiasson, C.T. Abdallah, and M.M. Hayat, "Resource-Constrained Load Balancing Controller for a Parallel Database,"
IEEE Trans. Control Systems Technology, vol. 16, no. 4, pp. 834-840, July 2008.- [31] J. Sonnek, A. Chandra, and J.B. Weissman, "Adaptive Reputation-Based Scheduling on Unreliable Distributed Infrastructures,"
IEEE Trans. Parallel and Distributed Systems, vol. 18, no. 11, pp. 1551-1564, Nov. 2007.- [32] V. Shestak, J. Smith, A. Maciejewski, and H. Siegel, "Stochastic Robustness Metric and Its Use for Static Resource Allocations,"
J. Parallel and Distributed Computing, vol. 68, pp. 1157-1173, 2008.- [33] V. Shestak, E.K.P. Chong, A.A. Maciejewski, and H.J. Siegel, "Robust Sequential Resource Allocation in Heterogeneous Distributed Systems with Random Compute Node Failures,"
Proc. IEEE Int'l Symp. Parallel and Distributed Processing (IPDPS), 2009.- [34] C. Fetzer, "Perfect Failure Detection in Timed Asynchronous Systems,"
IEEE Trans. Computers, vol. 52, no. 2, pp. 99-112, Feb. 2003.- [35] T. Ma, J. Hillston, and S. Anderson, "Evaluation of the Qos of Crash-Recovery Failure Detection,"
Proc. ACM Symp. Applied Computing, pp. 538-542, 2007.- [36] Free Software Foundation, "The GNU Scientific Library," http://www.gnu.org/sgsl, 2011.
- [37] J.E. Pezoa, S. Dhakal, and M.M. Hayat, "Decentralized Load Balancing for Improving Reliability in Heterogeneous Distributed Systems,"
Proc. Int'l Conf. Parallel Processing Workshops (ICPP '09), 2009. |