2013 IEEE 5th International Conference on Cloud Computing Technology and Science (2013)
Bristol, United Kingdom United Kingdom
Dec. 2, 2013 to Dec. 5, 2013
With cloud computing, a cycle of fault diagnosis and recovery becomes the norm. There is a large amount of monitoring data and log events available, but it is hard to figure out which events or metrics are critical in fault diagnosis. Other approaches model faults as a deviation from normal behaviors, and thus are less applicable in cloud where changes in the environment may impact what is considered normal. In this work, we propose an adaptive and flexible fault diagnosis framework to automatically identify the key fault indicators and detect fault patterns. Leveraging ideas from social media, we represent the hierarchical relationships among metrics and events as well as how they relate to faults. We apply the EdgeRank algorithm to decide the key events that contribute to a fault. Our approach works across different environments to detect the potential faults. We evaluated our framework using a cloud-based enterprise system using a list of injected faults that vary from environmental (e.g. virtual machine or network) to application degradation. We considered both private and public clouds. Our solution achieves over 90% detection accuracy with modest overhead. A comparison of our approach shows it is more accurate than alternative approaches in the literature.
Measurement, Fault diagnosis, Cloud computing, Correlation, Monitoring, Accuracy, Pattern matching
Q. Zhu, T. Tung and Q. Xie, "Automatic Fault Diagnosis in Cloud Infrastructure," 2013 IEEE 5th International Conference on Cloud Computing Technology and Science(CLOUDCOM), Bristol, United Kingdom United Kingdom, 2013, pp. 467-474.