2016 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW) (2016)
Ottawa, Ontario, Canada
Oct. 23, 2016 to Oct. 27, 2016
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/ISSREW.2016.7
Large community clusters are becoming increasingly common in universities and other organizations due to the benefits they provide to the researchers in terms of operational costs and resource availability. However, efficient administration, failure diagnosis, and performance debugging on community clusters are challenging tasks due to the sheer diversity of workloads and users. These clusters are typically shared by users coming from various scientific domains and experience levels. Many users have little experience in computing and, hence, often face performance issues—leading to resource wastage. In this paper, we study these dynamics in one of the largest university-wide community clusters (Conte at Purdue University). We perform in-depth analysis of library and application usage patterns, job failures and performance issues. Further, we introduce a set of novel analysis techniques that can be used to identify hidden trends and diagnose job failures in compute clusters in general. We provide concrete recommendations for the cluster administrators and present case studies highlighting how such information can be used to proactively solve many user issues, ultimately leading to better quality of service.
Libraries, Program processors, Supercomputers, Data analysis, Software reliability, Organizations
S. Mitra et al., "A Study of Failures in Community Clusters: The Case of Conte," 2016 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW), Ottawa, Ontario, Canada, 2016, pp. 189-196.