The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.06 - June (2013 vol.24)
pp: 1245-1255
Haibo Mi , National University of Defense Technology, Changsha
Huaimin Wang , National University of Defense Technology, ChangSha
Yangfan Zhou , The Chinese University of Hong Kong, Shatin
Michael Rung-Tsong Lyu , The Chinese University of Hong Kong, Hong Kong
Hua Cai , Alibaba Cloud Computing, Alibaba Inc., Hangzhou
ABSTRACT
Performance diagnosis is labor intensive in production cloud computing systems. Such systems typically face many real-world challenges, which the existing diagnosis techniques for such distributed systems cannot effectively solve. An efficient, unsupervised diagnosis tool for locating fine-grained performance anomalies is still lacking in production cloud computing systems. This paper proposes CloudDiag to bridge this gap. Combining a statistical technique and a fast matrix recovery algorithm, CloudDiag can efficiently pinpoint fine-grained causes of the performance problems, which does not require any domain-specific knowledge to the target system. CloudDiag has been applied in a practical production cloud computing systems to diagnose performance problems. We demonstrate the effectiveness of CloudDiag in three real-world case studies.
INDEX TERMS
Production, Cloud computing, Electronic mail, Synchronization, Time factors, Data collection, Clocks, request tracing, Cloud computing, performance diagnosis
CITATION
Haibo Mi, Huaimin Wang, Yangfan Zhou, Michael Rung-Tsong Lyu, Hua Cai, "Toward Fine-Grained, Unsupervised, Scalable Performance Diagnosis for Production Cloud Computing Systems", IEEE Transactions on Parallel & Distributed Systems, vol.24, no. 6, pp. 1245-1255, June 2013, doi:10.1109/TPDS.2013.21
REFERENCES
[1] V. Emeakaroha, M. Netto, R. Calheiros, I. Brandic, R. Buyya, and C. De Rose, "Towards Autonomic Detection of SLA Violations in Cloud Infrastructures," Future Generation Computer Systems, vol. 28, pp. 1017-1029, 2011.
[2] Z. Lan, Z. Zheng, and Y. Li, "Toward Automated Anomaly Identification in Large-Scale Systems," IEEE Trans. Parallel and Distributed Systems, vol. 21, no. 2, pp. 174-187, Feb. 2010.
[3] H. Malik, B. Adams, and A. Hassan, "Pinpointing the Subsystems Responsible for the Performance Deviations in a Load Test," Proc. IEEE 21st Int'l Symp. Software Reliability Eng. (ISSRE), pp. 201-210, 2010.
[4] P. Reynolds, C. Killian, J. Wiener, J. Mogul, M. Shah, and A. Vahdat, "Pip: Detecting the Unexpected in Distributed Systems," Proc. USENIX Third Symp. Networked Systems Design and Implementation (NSDI), pp. 115-128, 2006.
[5] E. Thereska and G. Ganger, "Ironmodel: Robust Performance Models in the Wild," ACM SIGMETRICS Performance Evaluation Rev., vol. 36, no. 1, pp. 253-264, 2008.
[6] E. Thereska, B. Salmon, J. Strunk, M. Wachs, M. Abd-El-Malek, J. Lopez, and G. Ganger, "Stardust: Tracking Activity in a Distributed Storage System," ACM SIGMETRICS Performance Evaluation Rev., vol. 34, no. 1, pp. 3-14, 2006.
[7] R. Sambasivan, A. Zheng, M. De Rosa, E. Krevat, S. Whitman, M. Stroucken, W. Wang, L. Xu, and G. Ganger, "Diagnosing Performance Changes by Comparing Request Flows," Proc. USENIX Eighth Symp. Networked Systems Design and Implementation (NSDI), pp. 43-56, 2011.
[8] A. Chanda, A. Cox, and W. Zwaenepoel, "Whodunit: Transactional Profiling For Multi-Tier Applications," ACM SIGOPS Operating Systems Rev., vol. 41, no. 3, pp. 17-30, 2007.
[9] R. Fonseca, G. Porter, R. Katz, S. Shenker, and I. Stoica, "X-Trace: A Pervasive Network Tracing Framework," Proc. USENIX Third Symp. Networked Systems Design and Implementation (NSDI), pp. 271-284, 2007.
[10] M. Chen, A. Accardi, E. Kiciman, J. Lloyd, D. Patterson, A. Fox, and E. Brewer, "Path-Based Faliure and Evolution Management," Proc. USENIX Symp. Networked Systems Design and Implementation (NSDI), pp. 23-36, 2004.
[11] E. Candes, X. Li, Y. Ma, and J. Wright, "Robust Principal Component Analysis?" Arxiv Preprint arXiv:0912.3599, 2009.
[12] F. Chang, J. Dean, S. Ghemawat, W. Hsieh, D. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. Gruber, "Bigtable: A Distributed Storage System for Structured Data," ACM Trans. Computer Systems, vol. 26, no. 2, pp. 1-26, 2008.
[13] M. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer, "Pinpoint: Problem Determination in Large, Dynamic Internet Services," Proc. IEEE Int'l Conf. Dependable Systems and Networks (DSN), pp. 595-604, 2002.
[14] D. Mills, "Network Time Protocol (Version 3) Specification, Implementation and Analysis," 1992.
[15] H. Abdi, "Coefficient of Variation," Encyclopedia of Research Design, N. Salkind, ed., pp. 1-5, SAGE, 2010.
[16] W. Xu, L. Huang, A. Fox, D. Patterson, and M. Jordan, "Online System Problem Detection by Mining Patterns of Console Logs," Proc. IEEE Int'l Conf. Data Mining (ICDM), pp. 588-597, 2009.
[17] A. Oliner and A. Aiken, "Online Detection of Multi-Component Interactions in Production Systems," Proc. IEEE/IFIP 41st Int'l Conf. Dependable Systems and Networks (DSN), pp. 49-60, 2011.
[18] H. Ringberg, A. Soule, J. Rexford, and C. Diot, "Sensitivity of PCA for Traffic Anomaly Detection," ACM SIGMETRICS Performance Evaluation Rev., vol. 35, no. 1, pp. 109-120, 2007.
[19] H. Mi, H. Wang, G. Yin, H. Cai, Q. Zhou, and T. Sun, "Performance Problems Diagnosis in Cloud Computing Systems by Mining Request Trace Logs," Proc. IEEE Network Operations and Management Symp. (NOMS), pp. 893-899, 2012.
[20] I. Jolliffe, Principal Component Analysis. Springer, 2002.
[21] Z. Lin, M. Chen, L. Wu, and Y. Ma, "The Augmented Lagrange Multiplier Method for Exact Recovery of Corrupted Low-Rank Matrices," Arxiv preprint arXiv:1009.5055, 2010.
[22] K. Nagaraja, F. Oliveira, R. Bianchini, R. Martin, and T. Nguyen, "Understanding and Dealing with Operator Mistakes in Internet Services," Proc. USENIX Sixth Conf. Symp. Operating Systems Design and Implementation (OSDI), pp. 5-20, 2004.
[23] P. Barham, A. Donnelly, R. Isaacs, and R. Mortier, "Using Magpie for Request Extraction and Workload Modelling," Proc. USENIX Sixth Conf. Symp. Operating Systems Design and Implementation (OSDI), pp. 259-272, 2004.
[24] E. Thereska and G. Ganger, "Ironmodel: Robust Performance Models in the Wild," ACM SIGMETRICS Performance Evaluation Rev., vol. 36, no. 1, pp. 253-264, 2008.
[25] B. Sigelman, L. Barroso, M. Burrows, P. Stephenson, M. Plakal, D. Beaver, S. Jaspan, and C. Shanbhag, "Dapper, a Large-Scale Distributed Systems Tracing Infrastructure," Technical Report dapper-2010-1, Google, 2010.
[26] H. Mi, H. Wang, Y. Zhou, M.R. Lyu, and H. Cai, "P-tracer: Path-Base Performance Profiling in Cloud Computing Systems," Proc. IEEE 36th Ann. Computer Software Applications Conf. (COMPSAC), pp. 509-514, 2012.
[27] W. Xu, L. Huang, A. Fox, D. Patterson, and M. Jordan, "Detecting Large-Scale System Problems by Mining Console Logs," Proc. ACM SIGOPS 22nd Symp. Operating Systems Principles, pp. 117-132, 2009.
[28] K. Nagaraj, C. Killian, and J. Neville, "Structured Comparative Analysis of Systems Logs to Diagnose Performance Problems," Proc. USENIX Ninth Conf. Networked Systems Design and Implementation (NSDI), pp. 26-39, 2012.
[29] B. Sang, J. Zhan, G. Lu, H. Wang, D. Xu, L. Wang, and Z. Zhang, "Precise, Scalable, and Online Request Tracing for Multi-Tier Services of Black Boxes," IEEE Trans. Parallel and Distributed Systems, no. 99, pp. 1-16, 2010.
[30] E. Gonina et al., "Fay: Extensible Distributed Tracing From Kernels to Clusters," Proc. ACM 23rd ACM Symp. Operating Systems Principles (SOSP), vol. 13, pp. 5-20, 2011.
18 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool