This Article 
 Bibliographic References 
 Add to: 
Change Distilling:Tree Differencing for Fine-Grained Source Code Change Extraction
November 2007 (vol. 33 no. 11)
pp. 725-743
A key issue in software evolution analysis is the identification of particular changes that occur across several versions of a program. We present change distilling, a tree differencing algorithm for fine-grained source code change extraction. For that, we have improved the existing algorithm of Chawathe et al. for extracting changes in hierarchically structured data. Our algorithm detects changes by finding a match between nodes of the compared two abstract syntax trees and a minimum edit script. We can identify change types between program versions according to our taxonomy of source code changes. We evaluated our change distilling algorithm with a benchmark we developed that consists of 1,064 manually classified changes in 219 revisions from three different open source projects. We achieved significant improvements in extracting types of source code changes: our algorithm approximates the minimum edit script by 45% better than the original change extraction approach by Chawathe et al. We are able to find all occurring changes and almost reach the minimum conforming edit script, i.e., we reach a mean absolute percentage error of 34%, compared to 79% reached by the original algorithm. The paper describes both the change distilling and the results of our evaluation.

[1] G.W. Adamson and J. Boreham, “The Use of an Association Measure Based on Character Structure to Identify Semantically Related Pairs of Words and Document Titles,” Information Storage and Retrieval, vol. 10, nos. 7-8, pp. 253-260, July-Aug. 1974.
[2] T. Apiwattanapong, A. Orso, and M.J. Harrold, “JDiff: A Differencing Technique and Tool for Object-Oriented Programs,” Automated Software Eng., vol. 14, no. 1, pp. 3-36, Mar. 2007.
[3] U. Asklund, “Identifying Conflicts during Structural Merge,” Proc. Nordic Workshop Programming Environment Research, pp. 231-242, June 1994.
[4] I.D. Baxter, A. Yahin, L.M. de Moura, M. Sant'Anna, and L. Bier, “Clone Detection Using Abstract Syntax Trees,” Proc. Int'l Conf. Software Maintenance, pp. 368-377, Nov. 1998.
[5] A. Bernstein, C. Kiefer, and E. Kaufmann, “SimPack: A Generic Java Library for Similarity Measures in Ontologies,” technical report, Dept. of Informatics, Univ. of Zürich, Switzerland, 2005.
[6] P. Bille, “A Survey on Tree Edit Distance and Related Problems,” Theoretical Computer Science, vol. 337, nos. 1-3, pp. 217-239, June 2005.
[7] G. Canfora, L. Cerulo, and M.D. Penta, “Identifying Changed Source Code Lines from Version Repositories,” Proc. Int'l Workshop Mining Software Repositories, p. 14, May 2007.
[8] S.S. Chawathe, A. Rajaraman, H. Garcia-Molina, and J. Widom, “Change Detection in Hierarchically Structured Information,” Proc. ACM Sigmod Int'l Conf. Management of Data, pp. 493-504, June 1996.
[9] D. Čubranić, G.C. Murphy, J. Singer, and K.S. Booth, “Hipikat: A Project Memory for Software Development,” IEEE Trans. Software Eng., vol. 31, no. 6, pp. 446-465, June 2005.
[10] S. Demeyer, S. Tichelaar, and P. Steyaert, Famix—The Famoos Information Exchange Model, 1999.
[11] L.R. Dice, “Measures of the Amount of Ecologic Association between Species,” ESA Ecology, no. 26, pp. 297-302, July 1945.
[12] B. Fluri and H.C. Gall, “Classifying Change Types for Qualifying Change Couplings,” Proc. Int'l Conf. Program Comprehension, pp.35-45, June 2006.
[13] H. Gall, K. Hayek, and M. Jazayeri, “Detection of Logical Coupling Based on Product Release History,” Proc. Int'l Conf. Software Maintenance, pp. 190-198, Nov. 1998.
[14] M.W. Godfrey and L. Zou, “Using Origin Analysis to Detect Merging and Splitting of Source Code Entities,” IEEE Trans. Software Eng., vol. 31, no. 2, pp. 166-181, Feb. 2005.
[15] A.E. Hassan and R.C. Holt, “Studying the Evolution of Software Systems Using Evolutionary Code Extractors,” Proc. Int'l Workshop Principles of Software Evolution, pp. 76-81, Sept. 2004.
[16] S. Horwitz, “Identifying the Semantic and Textual Differences between Two Versions of a Program,” Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, pp. 234-245, June 1990.
[17] J.J. Hunt, “Extensible, Language-Aware Differencing and Merging,” PhD dissertation, Univ. of Karlsruhe, Germany, 2001.
[18] J.W. Hunt and T.G. Szymanski, “A Fast Algorithm for Computing Longest Common Subsequences,” Comm. ACM, vol. 20, no. 5, pp.350-353, May 1977.
[19] P. Jaccard, “The Distribution of the Flora in the Alpine Zone,” New Phytologist, vol. 11, no. 2, pp. 37-50, Feb. 1912.
[20] D. Jackson and D.A. Ladd, “Semantic Diff: A Tool for Summarizing the Effects of Modifications,” Proc. Int'l Conf. Software Maintenance, pp. 243-252, Sept. 1994.
[21] U. Kelter, J. Wehren, and J. Niere, “A Generic Difference Algorithm for UML Models,” Software Eng., pp. 105-116, Mar. 2005.
[22] M. Kim and D. Notkin, “Program Element Matching for Multi-Version Program Analyses,” Proc. Int'l Workshop Mining Software Repositories, pp. 58-64, May 2006.
[23] M. Kim, D. Notkin, and D. Grossman, “Automatic Inference of Structural Changes for Matching Across Program Versions,” Proc. Int'l Conf. Software Eng., pp. 333-343, May 2007.
[24] S. Kim, K. Pan, and J.E. Whitehead, “When Functions Change Their Names: Automatic Detection of Origin Relationships,” Proc. Working Conf. Reverse Eng., pp. 143-152, Nov. 2005.
[25] M.M. Lehman, “Programs, Life Cycles and Laws of Software Evolution,” Proc. IEEE, pp. 1060-1076, Sept. 1980.
[26] V.I. Levenshtein, “Binary Codes Capable of Correcting Deletions, Insertions and Reversals,” Soviet Physics Doklady, vol. 10, pp. 707-710, Feb. 1966.
[27] J.I. Maletic and M.L. Collard, “Supporting Source Code Difference Analysis,” Proc. Int'l Conf. Software Maintenance, pp. 210-219, Sept. 2004.
[28] T. Mens, “A State-of-the-Art Survey on Software Merging,” IEEE Trans. Software Eng., vol. 28, no. 5, pp. 449-462, May 2002.
[29] R. Purushothaman and D.E. Perry, “Toward Understanding the Rhetoric of Small Source Code Changes,” IEEE Trans. Software Eng., vol. 31, no. 6, pp. 511-526, June 2005.
[30] S. Raghavan, R. Rohana, D. Leon, A. Podgurski, and V. Augustine, “Dex: A Semantic-Graph Differencing Tool for Studying Changes in Large Code Base,” Proc. Int'l Conf. Software Maintenance, pp.188-197, Sept. 2004.
[31] T. Sager, A. Bernstein, M. Pinzger, and C. Kiefer, “Detecting Similar Java Classes Using Tree Algorithms,” Proc. Int'l Workshop Mining Software Repositories, pp. 65-71, May 2006.
[32] D. Shasha and K. Zhang, “Fast Algorithms for the Unit Cost Editing Distance between Trees,” J. Algorithms, vol. 11, no. 4, pp.581-621, Dec. 1990.
[33] J. Śliwerski, T. Zimmermann, and A. Zeller, “Hatari: Raising Risk Awareness,” Proc. European Software Eng. Conf. and ACM SIGSOFT Symp. Foundations of Software Eng., pp. 107-110, Sept. 2005.
[34] J. Śliwerski, T. Zimmermann, and A. Zeller, “When Do Changes Induce Fixes?” Proc. Int'l Workshop Mining Software Repositories, pp.24-28, May 2005.
[35] K.-C. Tai, “The Tree-to-Tree Correction Problem,” J. ACM, vol. 26, no. 3, pp. 422-433, July 1979.
[36] Q. Tu and M.W. Godfrey, “An Integrated Approach for Studying Architectural Evolution,” Proc. Int'l Workshop Program Comprehension, pp. 127-136, June 2002.
[37] G. Valiente, Algorithms on Trees and Graphs. Springer, 2002.
[38] J. Weidl and H.C. Gall, “Binding Object Models to Source Code: An Approach to Object-Oriented Re-Architecting,” Proc. Computer Software and Applications Conf., pp. 26-31, Aug. 1998.
[39] B. Westfechtel, “Structure-Oriented Merging of Revisions of Software Documents,” Proc. Int'l Workshop Software Configuration Management, pp. 68-79, June 1991.
[40] M. Würsch, “Improving ChangeDistiller—Improving Abstract Syntax Tree Based Source Code Change Detection,” master's thesis, Univ. of Zürich, 2006.
[41] Z. Xing and E. Stoulia, “Analyzing the Evolutionary History of the Logical Design of Object-Oriented Software,” IEEE Trans. Software Eng., vol. 31, no. 10, pp. 850-868, Oct. 2005.
[42] Z. Xing and E. Stoulia, “UMLDiff: An Algorithm for Object-Oriented Design Differencing,” Proc. Int'l Conf. Automated Software Eng., pp. 54-65, Nov. 2005.
[43] W. Yang, “Identifying Syntactic Differences between Two Programs,” J. Software—Practice and Experience, vol. 21, no. 7, pp. 739-755, July 1991.
[44] W. Yang, “How to Merge Program Texts,” J. Systems and Software, vol. 27, no. 2, pp. 129-135, Nov. 1994.
[45] A.T. Ying, G.C. Murphy, R. Ng, and M.C. Chu-Carroll, “Predicting Source Code Changes by Mining Change History,” IEEE Trans. Software Eng., vol. 30, no. 9, pp. 574-586, Sept. 2004.
[46] K. Zhang, “Algorithms for the Constrained Editing Distance between Ordered Labeled Trees and Related Problems,” Pattern Recognition, vol. 28, no. 3, pp. 463-474, June 1995.
[47] T. Zimmermann, P. Weissgerber, S. Diehl, and A. Zeller, “Mining Version Histories to Guide Software Changes,” IEEE Trans. Software Eng., vol. 31, no. 6, pp. 429-445, June 2005.

Index Terms:
Source code change extraction, tree differencing algorithms, software repositories, software evolution analysis
Beat Fluri, Michael Wuersch, Martin PInzger, Harald Gall, "Change Distilling:Tree Differencing for Fine-Grained Source Code Change Extraction," IEEE Transactions on Software Engineering, vol. 33, no. 11, pp. 725-743, Nov. 2007, doi:10.1109/TSE.2007.70731
Usage of this product signifies your acceptance of the Terms of Use.