This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Improving Source Code Lexicon via Traceability and Information Retrieval
March/April 2011 (vol. 37 no. 2)
pp. 205-227
Andrea De Lucia, University of Salerno, Fisciano
Massimiliano Di Penta, University of Sannio, Benevento
Rocco Oliveto, University of Molise, Pesche
The paper presents an approach helping developers to maintain source code identifiers and comments consistent with high-level artifacts. Specifically, the approach computes and shows the textual similarity between source code and related high-level artifacts. Our conjecture is that developers are induced to improve the source code lexicon, i.e., terms used in identifiers or comments, if the software development environment provides information about the textual similarity between the source code under development and the related high-level artifacts. The proposed approach also recommends candidate identifiers built from high-level artifacts related to the source code under development and has been implemented as an Eclipse plug-in, called COde Comprehension Nurturant Using Traceability (COCONUT). The paper also reports on two controlled experiments performed with master's and bachelor's students. The goal of the experiments is to evaluate the quality of identifiers and comments (in terms of their consistency with high-level artifacts) in the source code produced when using or not using COCONUT. The achieved results confirm our conjecture that providing the developers with similarity between code and high-level artifacts helps to improve the quality of source code lexicon. This indicates the potential usefulness of COCONUT as a feature for software development environments.

[1] V.R. Basili, L.C. Briand, and W.L. Melo, “A Validation of Object-Oriented Design Metrics as Quality Indicators,” IEEE Trans. Software Eng., vol. 22, no. 10, pp. 751-761, Oct. 1996.
[2] T. Gyimóthy, R. Ferenc, and I. Siket, “Empirical Validation of Object-Oriented Metrics on Open Source Software for Fault Prediction,” IEEE Trans. Software Eng., vol. 31, no. 10, pp. 897-910, Oct. 2005.
[3] T. Zimmermann, R. Premraj, and A. Zeller, “Predicting Defects for Eclipse,” Proc. Third ICSE Int'l Workshop Predictor Models in Software Eng., 2007.
[4] S. Kim, E.J. Whitehead,Jr., and Y. Zhang, “Classifying Software Changes: Clean or Buggy?” IEEE Trans. Software Eng., vol. 34, no. 2, pp. 181-196, Mar./Apr. 2008.
[5] S. Kim, T. Zimmermann, E.J. WhiteheadJr., and A. Zeller, “Predicting Faults from Cached History,” Proc. 29th Int'l Conf. Software Eng., pp. 489-498, 2007.
[6] D. Lawrie, H. Feild, and D. Binkley, “Quantifying Identifier Quality: An Analysis of Trends,” Empirical Software Eng., vol. 12, no. 4, pp. 359-388, 2007.
[7] A. Takang, P. Grubb, and R. Macredie, “The Effects of Comments and Identifier Names on Program Comprehensibility: An Experiential Study,” J. Program Languages, vol. 4, no. 3, pp. 143-167, 1996.
[8] D. Lawrie, C. Morrell, H. Feild, and D. Binkley, “Effective Identifier Names for Comprehension and Memory,” Innovations in Systems and Software Eng., vol. 3, no. 4, pp. 303-318, 2007.
[9] D. Lawrie, C. Morrell, H. Feild, and D. Binkley, “What's in a Name? A Study of Identifiers,” Proc. 14th IEEE Int'l Conf. Program Comprehension, pp. 3-12, 2006.
[10] A. Marcus and D. Poshyvanyk, “The Conceptual Cohesion of Classes,” Proc. 21st IEEE Int'l Conf. Software Maintenance, pp. 133-142, 2005.
[11] A. Marcus, D. Poshyvanyk, and R. Ferenc, “Using the Conceptual Cohesion of Classes for Fault Prediction in Object-Oriented Systems,” IEEE Trans. Software Eng., vol. 34, no. 2, pp. 287-300, Mar./Apr. 2008.
[12] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R. Harshman, “Indexing by Latent Semantic Analysis,” J. Am. Soc. for Information Science, vol. 41, no. 6, pp. 391-407, 1990.
[13] A. Abadi, M. Nisenson, and Y. Simionovici, “A Traceability Technique for Specifications,” Proc. 16th IEEE Int'l Conf. Program Comprehension, pp. 103-112, 2008.
[14] G. Antoniol, G. Canfora, G. Casazza, A. De Lucia, and E. Merlo, “Recovering Traceability Links between Code and Documentation,” IEEE Trans. Software Eng., vol. 28, no. 10, pp. 970-983, Oct. 2002.
[15] G. Antoniol, G. Casazza, and A. Cimitile, “Traceability Recovery by Modelling Programmer Behaviour,” Proc. Seventh Working Conf. Reverse Eng., pp. 240-247, 2000.
[16] G. Capobianco, A. De Lucia, R. Oliveto, A. Panichella, and S. Panichella, “Traceability Recovery Using Numerical Analysis,” Proc. 16th Working Conf. Reverse Eng., 2009.
[17] J. Cleland-Huang, R. Settimi, C. Duan, and X. Zou, “Utilizing Supporting Evidence to Improve Dynamic Requirements Traceability,” Proc. 13th IEEE Int'l Requirements Eng. Conf., pp. 135-144, 2005.
[18] A. De Lucia, R. Oliveto, and P. Sgueglia, “Incremental Approach and User Feedbacks: A Silver Bullet for Traceability Recovery,” Proc. 22nd IEEE Int'l Conf. Software Maintenance, pp. 299-309, 2006.
[19] A. De Lucia, F. Fasano, R. Oliveto, and G. Tortora, “Recovering Traceability Links in Software Artefact Management Systems Using Information Retrieval Methods,” ACM Trans. Software Eng. and Methodology, vol. 16, no. 4, 2007.
[20] J.H. Hayes, A. Dekhtyar, and S.K. Sundaram, “Advancing Candidate Link Generation for Requirements Tracing: The Study of Methods,” IEEE Trans. Software Eng., vol. 32, no. 1, pp. 4-19, Jan. 2006.
[21] M. Lormans, A. Deursen, and H.-G. Gross, “An Industrial Case Study in Reconstructing Requirements Views,” Empirical Software Eng., vol. 13, no. 6, pp. 727-760, 2008.
[22] A. Marcus and J.I. Maletic, “Recovering Documentation-to-Source-Code Traceability Links Using Latent Semantic Indexing,” Proc. 25th Int'l Conf. Software Eng., pp. 125-135, 2003.
[23] R. Settimi, J. Cleland-Huang, O. Ben Khadra, J. Mody, W. Lukasik, and C. De Palma, “Supporting Software Evolution through Dynamically Retrieving Traces to UML Artifacts,” Proc. Seventh IEEE Int'l Workshop Principles of Software Evolution, pp. 49-54, 2004.
[24] E. Evans, Domain Driven Design: Tackling Complexity in the Heart of Software. Addison-Wesley Professional, 2003.
[25] A. De Lucia, R. Oliveto, and G. Tortora, “Assessing IR-Based Traceability Recovery Tools through Controlled Experiments,” Empirical Software Eng., vol. 14, no. 1, pp. 57-93, 2009.
[26] D. Poshyvanyk and A. Marcus, “Using Traceability Links to Assess and Maintain the Quality of Software Documentation,” Proc. Int'l Symp. Grand Challenges in Traceability, pp. 27-30, 2007.
[27] A. De Lucia, M. Di Penta, R. Oliveto, and F. Zurolo, “Improving Comprehensibility of Source Code via Traceability Information: A Controlled Experiment,” Proc. 14th IEEE Int'l Conf. Program Comprehension, pp. 317-326, 2006.
[28] J. Cleland-Huang, B. Berenbach, S. Clark, R. Settimi, and E. Romanova, “Best Practices for Automated Traceability,” Computer, vol. 40, no. 6, pp. 27-35, June 2007.
[29] A. De Lucia, F. Fasano, and R. Oliveto, “Traceability Management for Impact Analysis,” Proc. Frontiers of Software Maintenance, pp. 21-30, 2008.
[30] O. Gotel and A. Finkelstein, “An Analysis of the Requirements Traceability Problem,” Proc. First Int'l Conf. Requirements Eng., pp. 94-101, 1994.
[31] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval. Addison-Wesley, 1999.
[32] J.H. Hayes, A. Dekhtyar, and J. Osborne, “Improving Requirements Tracing via Information Retrieval,” Proc. 11th IEEE Int'l Requirements Eng. Conf., pp. 138-147, 2003.
[33] X. Zou, R. Settimi, and J. Cleland-Huang, “Term-Based Enhancement Factors for Improving Automated Requirement Trace Retrieval,” Proc. Int'l Symp. Grand Challenges in Traceability, pp. 40-45, 2007.
[34] G. Antoniol, G. Canfora, G. Casazza, and A. De Lucia, “Identifying the Starting Impact Set of a Maintenance Request,” Proc. Fourth European Conf. Software Maintenance and Reeng., pp. 227-230, 2000.
[35] S. Yadla, J.H. Hayes, and A. Dekhtyar, “Tracing Requirements to Defect Reports: An Application of Information Retrieval Techniques,” Innovations in Systems and Software Eng.: A NASA J., vol. 1, no. 2, pp. 116-124, 2005.
[36] D. Lawrie, H. Feild, and D. Binkley, “Leveraged Quality Assessment Using Information Retrieval Techniques,” Proc. 14th IEEE Int'l Conf. Program Comprehension, pp. 149-158, 2006.
[37] S. Patel, W. Chu, and R. Baxter, “A Measure for Composite Module Cohesion,” Proc. 14th Int'l Conf. Software Eng., pp. 38-48, 1992.
[38] D. Poshyvanyk and A. Marcus, “The Conceptual Coupling Metrics for Object-Oriented Systems,” Proc. 22nd IEEE Int'l Conf. Software Maintenance, pp. 469-478, 2006.
[39] L.H. Etzkorn, S. Gholston, and W.E. Hughes, “A Semantic Entropy Metric,” J. Software Maintenance: Research and Practice, vol. 14, no. 5, pp. 293-310, 2002.
[40] H. Sneed, “Object-Oriented Cobol Recycling,” Proc. Third Working Conf. Reverse Eng., pp. 169-178, 1996.
[41] B. Caprile and P. Tonella, “Restructuring Program Identifier Names,” Proc. 16th IEEE Int'l Conf. Software Maintenance, pp. 97-107, 2000.
[42] S. Haiduc and A. Marcus, “On the Use of Domain Terms in Source Code,” Proc. 16th IEEE Int'l Conf. Program Comprehension, pp. 113-122, 2008.
[43] S.L. Abebe, S. Haiduc, A. Marcus, P. Tonella, and G. Antoniol, “Analyzing the Evolution of the Source Code Vocabulary,” Proc. 12th European Conf. Software Maintenance and Reeng., pp. 189-198, 2008.
[44] G. Butler, P. Grogono, R. Shinghal, and I. Tjandra, “Retrieving Information from Data Flow Diagrams,” Proc. Second Working Conf. Reverse Eng., pp. 84-93, 1995.
[45] N. Anquetil and T. Lethbridge, “Assessing the Relevance of Identifier Names in a Legacy Software System,” Proc. 1998 Conf. Centre for Advanced Studies on Collaborative Research, 1998.
[46] D. Binkley, M. Davis, D. Lawrie, and C. Morrell, “To CamelCase or under Score,” Proc. 17th IEEE Int'l Conf. Program Comprehension, 2009.
[47] F. Deissenboeck and M. Pizka, “Concise and Consistent Naming,” Software Quality J., vol. 14, no. 3, pp. 261-282, 2006.
[48] D. Lawrie, H. Feild, and D. Binkley, “Syntactic Identifier Conciseness and Consistency,” Proc. Sixth IEEE Int'l Working Conf. Source Code Analysis and Manipulation, 2006.
[49] D. Lawrie, H. Feild, and D. Binkley, “An Empirical Study of Rules for Well-Formed Identifiers,” J. Software Maintenance, vol. 19, no. 4, pp. 205-229, 2007.
[50] B. Caprile and P. Tonella, “Nomen Est Omen: Analyzing the Language of Function Identifiers,” Proc. Sixth IEEE Working Conf. Reverse Eng., pp. 112-122, 1999.
[51] S.P. Reiss, “Automatic Code Stylizing,” Proc. 22nd IEEE/ACM Int'l Conf. Automated Software Eng., pp. 74-83, 2007.
[52] F. Corbo, C. Del Grosso, and M. Di Penta, “Smart Formatter: Learning Coding Style from Existing Source Code,” Proc. 23rd IEEE Int'l Conf. Software Maintenance, pp. 525-526, Oct. 2007.
[53] D. Lawrie, H. Feild, and D. Binkley, “Extracting Meaning from Abbreviated Identifiers,” Proc. Seventh IEEE Int'l Working Conf. Source Code Analysis and Manipulation, pp. 213-222, 2007.
[54] M.F. Porter, “An Algorithm for Suffix Stripping,” Program, vol. 14, no. 3, pp. 130-137, 1980.
[55] J.K. Cullum and R.A. Willoughby, “Real Rectangular Matrices,” Lanczos Algorithms for Large Symmetric Eigenvalue Computations, vol. 1, Birkhauser, 1998.
[56] M. Nagao and S. Mori, “A New Method of n-Gram Statistics for Large Number of n and Automatic Extraction of Words and Phrases from Large Text Data of Japanese,” Proc. 15th Int'l Conf. Computational Linguistics, vol. 1, pp. 611-615, 1994.
[57] V. Basili, G. Caldiera, and D.H. Rombach, The Goal Question Metric Paradigm, Encyclopedia of Software Engineering. John Wiley and Sons, 1994.
[58] C. Wohlin, P. Runeson, M. Host, M.C. Ohlsson, B. Regnell, and A. Wesslen, Experimentation in Software Engineering—An Introduction. Kluwer, 2000.
[59] A. De Lucia, F. Fasano, R. Francese, and G. Tortora, “ADAMS: An Artefact-Based Process Support System,” Proc. 16th Int'l Conf. Software Eng. and Knowledge Eng., pp. 31-36, 2004.
[60] M.E. Fagan, “Design and Code Inspections to Reduce Errors in Program Development,” IBM Systems J., vol. 15, no. 3, pp. 182-211, 1976.
[61] A.N. Oppenheim, Questionnaire Design, Interviewing and Attitude Measurement. Pinter Publishers, 1992.
[62] B. Bruegge and A.H. Dutoit, Object-Oriented Software Engineering. Prentice Hall, 2003.
[63] W.J. Conover, Practical Nonparametric Statistics, third ed. Wiley, 1998.
[64] V.B. Kampenes, T. Dybå, J.E. Hannay, and D.I.K. Sjøberg, “A Systematic Review of Effect Size in Software Engineering Experiments,” Information and Software Technology, vol. 49, nos. 11/12, pp. 1073-1086, 2007.
[65] J. Cohen, Statistical Power Analysis for the Behavioral Sciences, second ed. Lawrence Earlbaum Assoc., 1988.
[66] C.L. Goad and D.E. Johnson, “Crossover Experiments: A Comparison of ANOVA Tests and Alternative Analyses,” J. Agricultural, Biological, and Environmental Statistics, vol. 5, no. 1, pp. 69-87, 2000.
[67] L.C. Briand, Y. Labiche, M. Di Penta, and H.D. Yan-Bondoc, “An Experimental Investigation of Formality in UML-based Development,” IEEE Trans. Software Eng., vol. 31, no. 10, pp. 833-849, Oct. 2005.
[68] F. Ricca, M. Di Penta, M. Torchiano, P. Tonella, and M. Ceccato, “How Developers' Experience and Ability Influence Web Application Comprehension Tasks Supported by UML Stereotypes: A Series of Four Experiments,” IEEE Trans. Software Eng., vol. 36, no. 1, pp. 96-118, Jan./Feb. 2010.
[69] R.G. Miller, Simultaneous Statistical Inference, second ed. Springer Verlag, 1981.
[70] L. Briand, Y. Labiche, M. Di Penta, and H. Yan-Bondoc, “An Experimental Investigation of Formality in UML-Based Development,” IEEE Trans. Software Eng., vol. 31, no. 10, pp. 833-849, Oct. 2005.
[71] E. Arisholm and D. Sjoberg, “Evaluating the Effect of a Delegated versus Centralized Control Style on the Maintainability of Object-Oriented Software,” IEEE Trans. Software Eng., vol. 30, no. 8, pp. 521-534, Aug. 2004.
[72] S.T. Dumais, “Improving the Retrieval of Information from External Sources,” Behavior Research Methods, Instruments and Computers, vol. 23, pp. 229-236, 1991.

Index Terms:
Software traceability, source code comprehensibility, source code identifier quality, information retrieval, software development environments, empirical software engineering.
Citation:
Andrea De Lucia, Massimiliano Di Penta, Rocco Oliveto, "Improving Source Code Lexicon via Traceability and Information Retrieval," IEEE Transactions on Software Engineering, vol. 37, no. 2, pp. 205-227, March-April 2011, doi:10.1109/TSE.2010.89
Usage of this product signifies your acceptance of the Terms of Use.