This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code
July 2002 (vol. 28 no. 7)
pp. 654-670

A code clone is a code portion in source files that is identical or similar to another. Since code clones are believed to reduce the maintainability of software, several code clone detection techniques and tools have been proposed. This paper proposes a new clone detection technique, which consists of the transformation of input source text and a token-by-token comparison. For its implementation with several useful optimization techniques, we have developed a tool, named CCFinder, which extracts code clones in C, C++, Java, COBOL, and other source files. As well, metrics for the code clones have been developed. In order to evaluate the usefulness of CCFinder and metrics, we conducted several case studies where we applied the new tool to the source code of JDK, FreeBSD, NetBSD, Linux, and many other systems. As a result, CCFinder has effectively found clones and the metrics have been able to effectively identify the characteristics of the systems. In addition, we have compared the proposed technique with other clone detection techniques.

[1] B.S. Baker, “A Program for Identifying Duplicated Code,” Proc. Computing Science and Statistics: 24th Symp. Interface, vol. 24, pp. 49-57, Mar. 1992.
[2] B.S. Baker, “On Finding Duplication and Near-Duplication in Large Software System,” Proc. Second IEEE Working Conf. Reverse Eng., pp. 86-95, July 1995.
[3] M. Balazinska, E. Merlo, M. Dagenais, B. Lagüe, and K.A. Kontogiannis, “Measuring Clone Based Reengineering Opportunities,” Proc. Sixth IEEE Int'l Symp. Software Metrics (METRICS '99), pp. 292-303, Nov. 1999.
[4] M. Balazinska, E. Merlo, M. Dagenais, B. Lagüe, and K.A. Kontogiannis, “Partial Redesign of Java Software Systems Based on Clone Analysis,” Proc. Sixth IEEE Working Conf. Reverse Eng. (WCRE '99), pp. 326-336, Oct. 1999.
[5] I.D. Baxter, A. Yahin, L. Moura, M. Sant'Anna, and L. Bier, “Clone Detection Using Abstract Syntax Trees,” Proc. IEEE Int'l Conf. Software Maintenance (ICSM '98), pp. 368-377, Nov. 1998.
[6] G. Bracha, M. Odersky, D. Stoutamire, and P. Wadler, “GJ Specification,” http://www.cgl.uwaterloo.ca/~rnkazman/SE-papers.htmlhttp:/ /cm.bell-labs.com/cm/ cs/who/wadler/pizzagj/, 1998.
[7] S. Ducasse, M. Rieger, and S. Demeyer, A Language Independent Approach for Detecting Duplicated Code Proc. Int'l Conf. Software Maintenance (ICSM '99), pp. 109-118, Sept. 1999.
[8] FreeBSD, http:/www.freebsd.org/, 2002.
[9] Gnu Project, http:/www.gnu.org/, 2002.
[10] D. Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge Univ. Press, Cambridge, UK, 1997.
[11] J.H. Johnson, “Identifying Redundancy in Source Code Using Fingerprints,” Proc. IBM Centre for Advanced Studies Conference (CASCON '93), pp. 171-183, Oct. 1993.
[12] J.H. Johnson, “Substring Matching for Clone Detection and Change Tracking,” Proc. IEEE Int'l Conf. Software Maintenance (ICSM '94), pp. 120-126, Sept. 1994.
[13] B.-K. Kang and J. Bieman, "Using Design Abstractions to Visualize, Quantify, and Restructure Software," J. Systems and Software, to appear.
[14] K.A. Kontogiannis, R. De Mori, E. Merlo, M. Galler, and M. Bernstein, “Pattern Matching Techniques for Clone Detection and Concept Detection,” J. Automated Software Eng., vol. 3, pp. 770-108, 1996.
[15] B. Laguë, E.M. Merlo, J. Mayrand, and J. Hudepohl, “Assessing the Benefits of Incorporating Function Clone Detection in a Development Process,” Proc. IEEE Int'l Conf. Software Maintenance (ICSM '97), pp. 314-321, Oct. 1997.
[16] Linux Online, http:/www.linux.org/, 2002.
[17] J. Mayrand, C. Leblanc, and E.M. Merlo, “Experiment on the Automatic Detection of Function Clones in a Software System Using Metrics,” Proc. IEEE Int'l Conf. Software Maintenance (ICSM '96), pp. 244-253, Nov. 1996.
[18] A. Monden, D. Nakae, T. Kamiya, S. Sato, and K. Matsumoto, “Software Quality Analysis by Code in Industrial Legacy Software,” Proc. IEEE Eighth Int'l Software Metrics Symp. (METRICS '02), (to appear) June 2002.
[19] NetBSD Project, http:/www.netbsd.org/, 2002.
[20] OpenOffice.org Source Project, http:/www.openoffice.org/, 2002.
[21] S. Takabayashi, A. Monden, S. Sato, K. Matsumoto, K. Inoue, and K. Torii, “The Detection of Fault-Prone Program Using a NeuralNetwork,” Proc. SEA-UNU/IIST Int'l Symp. Future Software Technology (ISFST '99), pp. 81-86, Oct. 1999.
[22] The Source for Java Technology, http:/java.sun.com/, 2002.

Index Terms:
Code clone, duplicated code, CASE tool, metrics, maintenance.
Citation:
Toshihiro Kamiya, Shinji Kusumoto, Katsuro Inoue, "CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code," IEEE Transactions on Software Engineering, vol. 28, no. 7, pp. 654-670, July 2002, doi:10.1109/TSE.2002.1019480
Usage of this product signifies your acceptance of the Terms of Use.