
This Article  
 
Share  
Bibliographic References  
Add to:  
Digg Furl Spurl Blink Simpy Del.icio.us Y!MyWeb  
Search  
 
ASCII Text  x  
Yawei Li, Zhiling Lan, Prashasta Gujrati, XianHe Sun, "FaultAware Runtime Strategies for HighPerformance Computing," IEEE Transactions on Parallel and Distributed Systems, vol. 20, no. 4, pp. 460473, April, 2009.  
BibTex  x  
@article{ 10.1109/TPDS.2008.128, author = {Yawei Li and Zhiling Lan and Prashasta Gujrati and XianHe Sun}, title = {FaultAware Runtime Strategies for HighPerformance Computing}, journal ={IEEE Transactions on Parallel and Distributed Systems}, volume = {20}, number = {4}, issn = {10459219}, year = {2009}, pages = {460473}, doi = {http://doi.ieeecomputersociety.org/10.1109/TPDS.2008.128}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, }  
RefWorks Procite/RefMan/Endnote  x  
TY  JOUR JO  IEEE Transactions on Parallel and Distributed Systems TI  FaultAware Runtime Strategies for HighPerformance Computing IS  4 SN  10459219 SP460 EP473 EPD  460473 A1  Yawei Li, A1  Zhiling Lan, A1  Prashasta Gujrati, A1  XianHe Sun, PY  2009 KW  Scheduling KW  Faulttolerance KW  Parallel systems KW  Performance VL  20 JA  IEEE Transactions on Parallel and Distributed Systems ER   
[1] S. Albers and G. Schmidt, “Scheduling with Unexpected Machine Breakdowns,” Discrete Applied Math., vol. 110, nos. 23, pp. 8599, 2001.
[2] F. Berman et al., “New Grid Scheduling and Rescheduling Methods in the GrADS Project,” Int'l J. Parallel Programming, vol. 33, nos. 23, pp. 209229, 2005.
[3] A. Bouteiller, T. Herault, G. Krawezik, P. Lemarinier, and F. Cappello, “MPICHV: A Multiprotocol Automatic Fault Tolerant MPI,” Int'l J. High Performance Computing and Applications, vol. 20, no. 3, pp. 319333, 2006.
[4] E. Gabriel et al., “Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation,” Proc. 11th European PVM/MPI Users' Group Meeting (Euro PVM/MPI '04), Sept. 2004.
[5] S. Chakravorty, C. Mendes, and L. Kale, “Proactive Fault Tolerance in MPI Applications via Task Migration,” Proc. Int'l Conf. High Performance Computing (HiPC '06), p. 485, 2006.
[6] T. Cormen, C. Leiserson, R. Rivest, and C. Stein, Introduction to Algorithms, second ed. The MIT Press and McGrawHill Book, 2001.
[7] A. Dogan and F. Ozguner, “Reliable Matching and Scheduling of PrecedenceConstrained Tasks in Heterogeneous Distributed Computing,” Proc. Int'l Conf. Parallel Processing (ICPP '00), pp.307314, 2000.
[8] C. Du and X. Sun, “MPIMitten: Enabling Migration Technology in MPI,” Proc. Int'l Symp. Cluster Computing and the Grid (CCGRID '06), pp. 1118, 2006.
[9] E. Elnozahy, L. Alvisi, Y. Wang, and D. Johnson, “A Survey of Rollback Recovery Protocols in MessagePassing Systems,” ACM Computing Surveys, vol. 34, no. 3, pp. 375408, 2002.
[10] S. Hariri and C. Raghavendra, “Distributed Functions Allocation for Reliability and Delay Optimization,” Proc. ACM Fall Joint Computer Conf. (FJCC '86), pp. 344352, 1986.
[11] P. Hargrove and J. Duell, “Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters,” Proc. Scientific Discovery through Advanced Computing (SciDAC), 2006.
[12] R. Jain, The Art of Computer Systems, Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling. WileyInterscience, 1991.
[13] Parallel Workloads Archive, http://www.cs.huji.ac.il/labs/parallel workload /, 2008.
[14] S. Kartik and C. Murthy, “Task Allocation Algorithms for Maximizing Reliability of Distributed Computing Systems,” IEEE Trans. Computer Systems, vol. 46, pp. 719724, 1997.
[15] K. Limaye, C. Leangsuksun, and A. Tikotekar, “Fault Tolerance Enabled HPC Scheduling with HAOSCAR Framework,” Proc. High Availability and Performance Workshop (HAPCW), 2005.
[16] Y. Li, P. Gujrati, Z. Lan, and X. Sun, “FaultDriven ReScheduling for Improving SystemLevel Fault Resilience,” Proc. Int'l Conf. Parallel Processing (ICPP), 2007.
[17] Z. Lan and Y. Li, “Adaptive Fault Management of Parallel Applications for High Performance Computing,” IEEE Trans. Computers, vol. 57, no. 12, pp. 16471660, Dec. 2008.
[18] M. Litzkow, T. Tannenbaum, J. Basney, and M. Livny, “Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System,” Technical Report 1346, Univ. of WisconsinMadison Computer Science, 1997.
[19] C. Lu, “Scalable Diskless Checkpointing for Large Parallel Systems,” PhD dissertation, Univ. of Illinois at UrbanaChampaign, 2005.
[20] A. Mu'alem and D. Feitelson, “Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling,” IEEE Trans. Parallel and Distributed System, vol. 12, no. 6, pp. 529543, June 2001.
[21] A. Nagarajan, F. Mueller, C. Engelmann, and S. Scott, “Proactive Fault Tolerance for HPC with Xen Virtualization,” Proc. Int'l Conf. Supercomputing (ICS '07), pp. 2332, 2007.
[22] D. Nurmi, A. Mandal, J. Brevik, C. Koelbel, R. Wolski, and K. Kennedy, “Evaluation of a Workflow Scheduler Using Integrated Performance Modeling and Batch Queue Wait Time Prediction,” Proc. ACM/IEEE Conf. Supercomputing (SC), 2006.
[23] A.J. Oliner, L. Rudolph, and R.K. Sahoo, “Cooperative Checkpointing a Robust Approach to LargeScale Systems Reliability,” Proc. Int'l Conf. Supercomputing (ICS '06), pp. 1423, 2006.
[24] A. Oliner, R. Sahoo, J. Moreira, and M. Gupta, “FaultAware Job Scheduling for BlueGene/L Systems,” Proc. 18th Int'l Parallel and Distributed Processing Symp. (IPDPS '04), p. 64, 2004.
[25] T. Ozaki, T. Dohi, H. Okamura, and N. Kaio, “Minmax Checkpoint Placement under Incomplete Failure Information,” Proc. Int'l Conf. Dependable Systems and Networks (DSN '04), p. 721, 2004.
[26] F. Petrini, “Scaling to Thousands of Processors with Buffered Coscheduling,” Proc. Scaling to New Height Workshop, 2002.
[27] J. Plank, M. Beck, G. Kingsley, and K. Li, “Libckpt: Transparent Checkpointing under Unix,” Proc. Usenix, 1995.
[28] J. Plank, K. Li, and M. Puening, “Diskless Checkpointing,” IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 10, pp. 972986, Oct. 1998.
[29] J. Plank and M. Thomason, “Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems,” J. Parallel and Distributed Computing, vol. 61, no. 11, pp. 15701590, 2001.
[30] D. Reed, C. Lu, and C. Mendes, “Big Systems and Big Reliability Challenges,” Proc. Parallel Computing (ParCo '03), pp. 729736, 2003.
[31] R. Sahoo et al., “Critical Event Prediction for Proactive Management in LargeScale Computer Clusters,” Proc. Int'l Conf. Knowledge Discovery and Data Mining (KDDM '03), pp. 426435, 2003.
[32] B. Schroeder and G. Gibson, “A Large Scale Study of Failures in HighPerformanceComputing Systems,” Proc. Int'l Symp. Dependable Systems and Networks (DSN), 2006.
[33] M. Schulz, G. Bronevetsky, R. Fernandes, D. Marques, K. Pingali, and P. Stodghill, “Implementation and Evaluation of a Scalable Application Level CheckpointRecovery Scheme for MPI Programs,” Proc. ACM/IEEE Conf. Supercomputing (SC '04), p. 38, 2004.
[34] S. Shatz, J. Wang, and M. Goto, “Task Allocation for Maximizing Reliability of Distributed Computer Systems,” IEEE Trans. Computers, vol. 41, no. 9, pp. 11561168, Sept. 1992.
[35] R. Smith and D. Dietrich, “The Bathtub Curve: An Alternative Explanation,” Proc. Ann. Reliability and Maintainability Symp. (RAMS '94), pp. 241247, 1994.
[36] J. Smith, “A Survey of Process Migration Mechanisms,” Operating Systems Rev., vol. 22, no. 3, pp. 102106, 1988.
[37] J. Squyres and A. Lumsdaine, “A Component Architecture for LAM/MPI,” Proc. 10th European PVM/MPI Users' Group Meeting (Euro PVM/MPI), 2003.
[38] S. Srinivasan and N. Jha, “Safety and Reliability Driven Task Allocation in Distributed Systems,” IEEE Trans. Parallel and Distributed Systems, vol. 10, no. 3, Mar. 1999.
[39] D. Tsafrir, Y. Etsion, and D. Feitelson, “Backfilling Using SystemGenerated Predictions Rather than User Runtime Estimates,” IEEE Trans. Parallel and Distributed Systems, vol. 18, no. 6, June 2007.
[40] R. Vilalta and S. Ma, “Predicting Rare Events in Temporal Domains,” Proc. IEEE Int'l Conf. Data Mining (ICDM), 2002.
[41] L. Wang, K. Pattabiraman, L. Votta, A.C. Vick, Z. Wood, and R. Kalbarczyk, “Modeling Coordinated Checkpointing for LargeScale Supercomputers,” Proc. Int'l Conf. Dependable Systems and Networks (DSN '05), pp. 812821, 2005.
[42] J. Young, “A First Order Approximation to the Optimal Checkpoint Interval,” ACM Comm., vol. 17, no. 9, pp.530531, 1974.
[43] Y. Zhang, M. Squillante, A. Sivasubramaniam, and R. Sahoo, “Performance Implications of Failures in LargeScale Cluster Scheduling,” Proc. Workshop Job Scheduling Strategies for Parallel Processing (JSSPP '04), pp. 233252, 2004.
[44] Z. Zheng, Y. Li, and Z. Lan, “Anomaly Localization in LargeScale Clusters,” Proc. IEEE Int'l Conf. Cluster Computing (Cluster), 2007.
[45] P. Gujrati, Y. Li, Z. Lan, R. Thakur, and J. White, “A MetaLearning Failure Predictor for Blue Gene/L Systems,” Proc. Int'l Conf. Parallel Processing (ICPP), 2007.
[46] M. Morris, “Kiviat Graphs: Conventions and Figures of Merit,” ACM SIGMETRICS Performance Evaluation Rev., vol. 3, no. 3, 1974.
[47] S. Fu and C.Z. Xu, “Exploring Event Correlation for Failure Prediction in Coalitions of Clusters,” Proc. ACM/IEEE Conf. Supercomputing (SC), 2007.
[48] J. Gu, Z. Zheng, Z. Lan, J. White, E. Hocks, and B. Park, “Dynamic MetaLearning for Failure Prediction in LargeScale Systems: A Case Study,” Proc. Int'l Conf. Parallel Processing (ICPP), 2008.