This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
CP-Miner: Finding Copy-Paste and Related Bugs in Large-Scale Software Code
March 2006 (vol. 32 no. 3)
pp. 176-192
Recent studies have shown that large software suites contain significant amounts of replicated code. It is assumed that some of this replication is due to copy-and-paste activity and that a significant proportion of bugs in operating systems are due to copy-paste errors. Existing static code analyzers are either not scalable to large software suites or do not perform robustly where replicated code is modified with insertions and deletions. Furthermore, the existing tools do not detect copy-paste related bugs. In this paper, we propose a tool, CP-Miner, that uses data mining techniques to efficiently identify copy-pasted code in large software suites and detects copy-paste bugs. Specifically, it takes less than 20 minutes for CP-Miner to identify 190,000 copy-pasted segments in Linux and 150,000 in FreeBSD. Moreover, CP-Miner has detected many new bugs in popular operating systems, 49 in Linux and 31 in FreeBSD, most of which have since been confirmed by the corresponding developers and have been rectified in the following releases. In addition, we have found some interesting characteristics of copy-paste in operating system code. Specifically, we analyze the distribution of copy-pasted code by size (number lines of code), granularity (basic blocks and functions), and modification within copy-pasted code. We also analyze copy-paste across different modules and various software versions.

[1] Z. Li, S. Lu, S. Myagmar, and Y. Zhou, “CP-Miner: A Tool for Finding Copy-Paste and Related Bugs in Operating System Code,” Proc. Symp. Operating System Design and Implementation, pp. 289-302, 2004.
[2] B.S. Baker, “On Finding Duplication and Near-Duplication in Large Software Systems,” Proc. Second Working Conf. Reverse Eng., p. 86, 1995.
[3] S. Ducasse, M. Rieger, and S. Demeyer, “A Language Independent Approach for Detecting Duplicated Code,” Proc. Int'l Conf. Software Maintenance, pp. 109-118, 1999.
[4] C. Kapser and M.W. Godfrey, “Toward a Taxonomy of Clones in Source Code: A Case Study,” Evolution of Large-Scale Industrial Software Applications, Sept. 2003.
[5] T. Kamiya, S. Kusumoto, and K. Inoue, “CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code,” IEEE Trans. Software Eng., vol. 28, no. 7, pp. 654-670, July 2002.
[6] I.D. Baxter, A. Yahin, L. Moura, M. Sant'Anna, and L. Bier, “Clone Detection Using Abstract Syntax Trees,” Proc. Int'l Conf. Software Maintenance, p. 368, 1998.
[7] A. Chou, J. Yang, B. Chelf, S. Hallem, and D.R. Engler, “An Empirical Study of Operating System Errors,” Proc. Symp. Operating Systems Principles, pp. 73-88, 2001.
[8] “Linux Kernel Mailing List,” http:/lkml.org, 2005.
[9] A. Chou, B. Chelf, D.R. Engler, and M. Heinrich, “Using Meta-Level Compilation to Check FLASH Protocol Code,” Proc. Int'l Conf. Architectural Support for Programming Languages and Operating System, pp. 59-70, 2000.
[10] D. Engler and K. Ashcraft, “RacerX: Effective, Static Detection of Race Conditions and Deadlocks,” Proc. ACM Symp. Operating Systems Principles, pp. 237-252, 2003.
[11] S. Hallem, B. Chelf, Y. Xie, and D. Engler, “A System and Language for Building System-Specific, Static Analyses,” Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, pp. 69-82, 2002.
[12] M. Musuvathi, D. Park, A. Chou, D.R. Engler, and D.L. Dill, “CMC: A Pragmatic Approach to Model Checking Real Code,” Proc. Symp. Operating Systems Design and Implementation, Dec. 2002.
[13] R. Hastings and B. Joyce, “Purify: Fast Detection of Memory Leaks and Access Errors,” Proc. Winter USENIX Conf., pp. 158-185, Dec. 1992.
[14] N. Nethercote and J. Seward, “Valgrind: A Program Supervision Framework,” Proc. Third Workshop Runtime Verification, 2003.
[15] J. Condit, M. Harren, S. McPeak, G.C. Necula, and W. Weimer, “CCured in the Real World,” Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, pp. 232-244, 2003.
[16] S. Grier, “A Tool that Detects Plagiarism in Pascal Programs,” Proc. 12th SIGCSE Technical Symp. Computer Science Education, pp. 15-20, 1981.
[17] H.T. Jankowitz, “Detecting Plagiarism in Student Pascal Programs,” Computer J., vol. 31, no. 1, pp. 1-8, 1988.
[18] L. Prechelt, G. Malpohl, and M. Philippsen, “Finding Plagiarisms among a Set of Programs with JPlag,” J. Universal Computer Science, vol. 8, no. 11, pp. 1016-1038, Nov. 2002.
[19] A. Aiken, “Moss: A System for Detecting Software Plagiarism,” http://www.cs.berkeley.edu/aikenmoss.html , 2005.
[20] S. Schleimer, D.S. Wilkerson, and A. Aiken, “Winnowing: Local Algorithms for Document Fingerprinting,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 76-85, 2003.
[21] B.S. Baker, “A Program for Identifying Duplicated Code,” Computing Science and Statistics, vol. 24, pp. 49-57, 1992.
[22] J.H. Johnson, “Substring Matching for Clone Detection and Change Tracking,” Proc. Int'l Conf. Software Maintenance, pp. 120-126, 1994.
[23] J.H. Johnson, “Identifying Redundancy in Source Code Using Fingerprints,” Proc. Conf. Centre for Advanced Studies on Collaborative Research, Oct. 1993.
[24] R. Komondoor and S. Horwitz, “Using Slicing to Identify Duplication in Source Code,” Proc. Eighth Int'l Symp. Static Analysis, 2001.
[25] K. Kontogiannis, M. Galler, and R. DeMori, “Detecting Code Similarity Using Patterns,” Working Notes Third Workshop AI and Software Eng.: Breaking the Toy Mold, 1995.
[26] J. Krinke, “Identifying Similar Code with Program Dependence Graphs,” Proc. Eighth Working Conf. Reverse Eng., 2001.
[27] J. Mayrand, C. Leblanc, and E. Merlo, “Experiment on the Automatic Detection of Function Clones in a Software System Using Metrics,” Proc. Int'l Conf. Software Maintenance, p. 244, 1996.
[28] U. Manber, “Finding Similar Files in a Large File System,” Proc. USENIX Winter 1994 Technical Conf., pp. 1-10, 1994.
[29] K.W. Church and J.I. Helfman, “Dotplot: A Program for Exploring Self-Similarity in Millions of Lines of Text and Code,” J. Computational and Graphical Statistics, 1993.
[30] S. Hangal and M.S. Lam, “Tracking Down Software Bugs Using Automatic Anomaly Detection,” Proc. Int'l Conf. Software Eng., May 2002.
[31] S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, and T. Anderson, “Eraser: A Dynamic Data Race Detector for Multithreaded Programs,” ACM Trans. Computer Systems, vol. 15, no. 4, pp. 391-411, 1997.
[32] D. Engler, D.Y. Chen, and A. Chou, “Bugs as Inconsistent Behavior: A General Approach to Inferring Errors in Systems Code,” Proc. ACM Symp. Operating Systems Principles, pp. 57-72, 2001.
[33] U. Stern and D.L. Dill, “Automatic Verification of the SCI Cache Coherence Protocol,” Proc. Conf. Correct Hardware Design and Verification Methods, pp. 21-34, 1995.
[34] J.-D. Choi, K. Lee, A. Loginov, R. O'Callahan, V. Sarkar, and M. Sridharan, “Efficient and Precise Datarace Detection for Multithreaded Object-Oriented Programs,” Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, pp. 258-269, 2002.
[35] R. Agrawal and R. Srikant, “Mining Sequential Patterns,” Proc. 11th Int'l Conf. Data Eng., 1995.
[36] X. Yan, J. Han, and R. Afshar, “CloSpan: Mining Closed Sequential Patterns in Large Datasets,” Proc. SIAM Int'l Conf. Data Mining, May 2003.
[37] Z. Li, Z. Chen, S.M. Srinivasan, and Y. Zhou, “C-Miner: Mining Block Correlations in Storage Systems,” Proc. Third USENIX Conf. File and Storage Technologies, pp. 173-186, 2004.
[38] A.V. Aho, R. Sethi, and J. Ullman, Compilers: Principles, Techniques and Tools. Addison-Wesley, 1986.
[39] R.E. Johnson and W.F. Opdyke, “Refactoring and Aggregation,” Proc. Int'l Symp. Object Technologies for Advanced Software, pp. 264-278, 1993.

Index Terms:
Software analysis, code reuse, code duplication, debugging aids, data mining.
Citation:
Zhenmin Li, Shan Lu, Suvda Myagmar, Yuanyuan Zhou, "CP-Miner: Finding Copy-Paste and Related Bugs in Large-Scale Software Code," IEEE Transactions on Software Engineering, vol. 32, no. 3, pp. 176-192, March 2006, doi:10.1109/TSE.2006.28
Usage of this product signifies your acceptance of the Terms of Use.