The Community for Technology Leaders
Issue No. 06 - Nov.-Dec. (2017 vol. 10)
ISSN: 1939-1374
pp: 984-998
Andrea Rosa , Faculty of Informatics, Università della Svizzera italiana (USI), Lugano, Switzerland
Lydia Y. Chen , Cloud Server Technologies Group, IBM Research Lab Zurich, Rüschlikon, Switzerland
Walter Binder , Faculty of Informatics, Università della Svizzera italiana (USI), Lugano, Switzerland
ABSTRACT
Motivated by the high complexity of today’s datacenters, a large body of studies tries to understand workloads and resource utilization in datacenters. However, there is little work on exploring unsuccessful job and task executions. In this article, we study and predict three types of unsuccessful executions in traces of a Google datacenter, namely $\sf {fail}$ , $\sf {kill}$ , and $\sf {eviction}$ . We first quantitatively show their strongly negative impact on machine time and the resulting task slowdown. We analyze patterns of unsuccessful jobs and tasks, particularly focusing on their interdependencies, and we uncover their root causes by inspecting key workload and system attributes. Furthermore, we develop three on-line prediction models that can classify jobs and events into four classes upon arrival time, using independent or nested Neural Networks. We explore different combinations of feature sets and techniques to reduce the computational overhead. Our evaluation results show that the proposed models can accurately classify 94.4 percent of jobs and 76.8 percent of events into four classes.
INDEX TERMS
Time factors, Predictive models, Failure analysis, Google, Computational modeling, Analytical models, Resource management
CITATION

A. Rosa, L. Y. Chen and W. Binder, "Failure Analysis and Prediction for Big-Data Systems," in IEEE Transactions on Services Computing, vol. 10, no. 6, pp. 984-998, 2017.
doi:10.1109/TSC.2016.2543718
FULL ARTICLE
CITATIONS
SHARE
477 ms
(Ver 3.3 (11022016))