This Article 
 Bibliographic References 
 Add to: 
Recovering Traceability Links between Code and Documentation
October 2002 (vol. 28 no. 10)
pp. 970-983

Abstract—Software system documentation is almost always expressed informally in natural language and free text. Examples include requirement specifications, design documents, manual pages, system development journals, error logs, and related maintenance reports. We propose a method based on information retrieval to recover traceability links between source code and free text documents. A premise of our work is that programmers use meaningful names for program items, such as functions, variables, types, classes, and methods. We believe that the application-domain knowledge that programmers process when writing the code is often captured by the mnemonics for identifiers; therefore, the analysis of these mnemonics can help to associate high-level concepts with program concepts and vice-versa. We apply both a probabilistic and a vector space information retrieval model in two case studies to trace C++ source code onto manual pages and Java code to functional requirements. We compare the results of applying the two models, discuss the benefits and limitations, and describe directions for improvements.

[1] G. Adamson and G. Boreham, “The Use of an Associative Measure Based on Character Structure to Identify Semantically Related Pairs of Words and Document Titles,” Information Storage and Retrieval, vol. 2, no. 10, pp. 253–260 Oct. 1974.
[2] R.C. Angell, G.E. Freund, and P. Willett, “Automatic Spelling Correction Using a Trigram Similarity Measure,” Information Processing and Management, vol. 19, no. 4, pp. 255–261 Apr. 1983.
[3] G. Antoniol, G. Canfora, G. Casazza, and A. DeLucia, “Identifying the Starting Impact Set of a Maintenance Request: A Case Study,” Proc. European Conf. Software Maintenance and Reeng., pp. 227–230, Mar. 2000.
[4] G. Antoniol, G. Canfora, G. Casazza, and A. DeLucia, “Information Retrieval Models for Recovering Traceability Links between Code and Documentation,” Proc. IEEE Int'l Conf. Software Maintenance, pp. 40–49, Oct. 2000.
[5] G. Antoniol, G. Canfora, G. Casazza, A. DeLucia, and E. Merlo, “Tracing Object-Oriented Code into Functional Requirements,” Proc. Eighth Int'l Workshop Program Comprehension, pp. 227–230, June 2000.
[6] G. Antoniol, G. Canfora, A. DeLucia, and E. Merlo, “Recovering Code to Documentation Links in Object-Oriented Systems,” Proc. IEEE Working Conf. Reverse Eng., pp. 136–144, Oct. 1999.
[7] G. Antoniol, B. Caprile, A. Potrich, and P. Tonella, “Design-Code Traceability for Object-Oriented Systems,” Annals of Software Eng., vol. 9, pp. 35-58, 2000.
[8] G. Antoniol, G. Casazza, and A. Cimitile, Traceability Recovery by Modeling Programmer Behavior Proc. Seventh IEEE Working Conf. Reverse Eng., pp. 240-247, Nov. 2000.
[9] R. Arnold, “Impact Analysis—Towards a Framework for Comparison,” Proc. IEEE Conf. Software Maintenance, pp. 234-243, 1993.
[10] S.P. Arnold and S.L. Stepowey, “The Reuse System: Cataloging and Retrieval of Reusable Software,” Software Reuse: Emerging Technology, W. Tracz, ed., 1987.
[11] L. Bain and M. Engelhardt, Introduction to Probability and Mathematical Statistics. Belmont, Calif.: Duxburry Press, 1992.
[12] T. Biggerstaff, "Design Recovery for Maintenance and Reuse," Computer, July 1989.
[13] T. Biggerstaff, B.G. Mitbander, and D. Webster, "The concept assignment problem in program understanding," Proc. 15th Int'l Conf. Software Engineering, pp. 482-498, 1993.
[14] R. Brooks, “Towards a Theory of the Comprehension of Computer Programs,” Int'l J. Man-Machine Studies, vol. 18, pp. 543–554, 1983.
[15] B.A. Burton, R.W. Aragon, S.A. Bailey, K. Koelher, and L.A. Mayes, “The Reusable Software Library,” Software Reuse: Emerging Technology, W. Tracz, ed., pp. 129–137, 1987.
[16] G. Caldiera and V.R. Basili, "Identifying and Qualifying Reusable Software Components," Computer, pp .61-70, Feb. 1991.
[17] G. Canfora, A. Cimitile, M. Munro, “Re2: Reverse Engineering and Reuse Re-Engineering,” J. Software Maintenance—Research and Practice, vol. 6, pp. 53–72, 1994.
[18] E.J. Chikofsky and J.H. Cross II, "Reverse Engineering and Design Recovery: A Taxonomy," IEEE Software, Vol. 7, No. 1, Jan./Feb. 1990, pp. 13-17.
[19] T.H. Cormen,C.E. Leiserson, and R.L. Rivest,Introduction to Algorithms.Cambridge, Mass.: MIT Press/McGraw-Hill, 1990.
[20] T.M. Cover and J.A. Thomas, Elements of Information Theory. John Wiley&Sons, 1991.
[21] R. DeMori, Spoken Dialogues with Computers, Orlando, Fla.: Academic Press, Inc., 1998.
[22] R. Fiutem and A. Antoniol, “Identifying Design-Code Inconsistencies in Object-Oriented Software: A Case Study,” Proc. Int'l Conf. Software Maintenance, pp. 94-102, 1998.
[23] W.B. Frakes and R. Baeza-Yates, Information Retrieval Data Structures&Algorithmss.Englewood Cliffs, N.J.: Prentice Hall, 1992.
[24] W.B. Frakes and B.A. Nejmeh, “Software Reuse through Information Retrieval,” Proc. 20th Ann. HICSS, pp. 530–535, Jan. 1987.
[25] M.J. Fyson and C. Boldyreff, “Using Application Understanding to Support Impact Analysis,” J. Software Maintenance—Research and Practice, vol. 10, pp. 93–110, 1998.
[26] D. Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge Univ. Press, Cambridge, UK, 1997.
[27] D. Harman, “Ranking Algorithms,” Information Retrieval: Data Structures and Algorithms, pp. 363–392, 1992.
[28] J. Konclin and M. Bergen, “Gibis: A Hypertext Tool for Exploratory Policy Discussion,” ACM Trans. Office Information Systems, vol. 6, no. 4, pp. 303–331, Oct. 1988.
[29] R. Kuhn and R. De Mori,“Learning speech semantics with keyword classification trees,” Proc. ICASSP 93, vol. II, pp. 55-58, Apr. 1993.
[30] S. Letovsky, “Cognitive Processes in Program Comprehension,” Proc. First Workshop Empirical Studies of Programmers, Ablex Publishing, Norwood, N.J., 1986, pp. 58-79.
[31] Y.S. Yoelle, S. Maarek,D.M. Berry,, and G.E. Kaiser,“An information retrieval approach for automatically constructing software libraries,” IEEE Trans. Software Engineering, vol. 17, no. 8, pp. 800-813, Aug. 1991.
[32] A. von Mayrhauser and A. Vans, "From Program Comprehension to Tool Requirements for an Industrial Environment," Proc. Second Workshop on Program Comprehension,Capri, Italy, pp. 78-86, July 1993.
[33] A. von Mayrhauser and A. Vans, "Dynamic Code Cognition Behaviors For Large Scale Code," Proc. Third Workshop on Program Comprehension,Washington D.C., pp. 74-81, Nov.14-15, 1994.
[34] A. von Mayrhauser and A.M. Mans, “Identification of Dynamic Comprehension Processes During Large Scale Maintenance,” IEEE Trans. Software Eng., vol. 22, no. 6, pp. 424–437, 1996.
[35] E. Merlo, I. McAdam, and R.D. Mori, “Source Code Informal Information Analysis Using Connectionist Models,” Proc. Int'l Joint Conf. Artificial Intelligence, pp. 1339–1344, Sept. 1993.
[36] G.C. Murphy, D. Notkin, and K. Sullivan, “Software Reflexion Models: Bridging the Gap Between Source and High-Level Models,” Proc. Third ACM SIGSOFT Symp. Foundations of Software Eng., pp. 18-28, Oct. 1995.
[37] H. Ney and U. Essen, “On Smoothing Techniques for Bigram-bases Natural Language Modelling,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing, vol. 12, no. 11, pp. 825–828, 1991.
[38] P.W. Oman and C.R. Cook, “The Book Paradigm for Improved Maintenance,” IEEE Software, vol. 7, no. 1, pp. 39–45, Jan. 1990.
[39] N. Pennington, “Comprehension Strategies in Programming,” Proc. Second Workshop Empirical Studies of Programmers, Ablex Publishing, Norwood, N.J., 1987, pp. 100-112.
[40] N. Pennington, “Stimulus Structures and Mental Representations in Expert Comprehension of Computer Programs,” Cognitive Psychology, vol. 19, pp. 295–341, 1987.
[41] M. Pighin, “Tracing Object-Oriented Code into Functional Requirements,” Proc. Fifth Conf. Software Maintenance and Reeng., pp. 196–199, Mar. 2001.
[42] F. Pinhero and J. Goguen, "An Object-Oriented Tool for Tracing Requirements," IEEE Software, Mar. 1996, pp. 52-64.
[43] B. Potter, J. Sinclair, and D. Till, An Introduction to Formal Specification and Z, Int'l Series in Computer Science, Prentice Hall, New York, 1991.
[44] B. Ramesh and V. Dhar, "Supporting Systems Development by Capturing Deliberations During Requirements Engineering," IEEE Trans. Software Eng., pp. 498-510, vol. 18, June 1992.
[45] C. Rich and R. Waters, The Programmer's Apprentice. ACM Press, 1990.
[46] R. Rosenfeld, “Adaptive Statistical Language Modeling: A Statistical Approach,” PhD thesis, School of Computer Science, Carnegie Mellon Univ., Apr. 1994.
[47] S. Rugaber and R. Clayton, “The Representation Problem in Reverse Engineering,” Proc. Working Conf. Reverse Eng., pp. 8–16, 1993.
[48] G. Salton and C. Buckley, "Term-Weighting Approaches in Automatic Text Retrieval," Information Processing and Management, vol. 24, 1988, pp. 513-523.
[49] M. Sefika, A. Sane, and R.H. Campbell, “Monitoring Compliance of a Software System with Its High-Level Design Models,” Proc. Int'l Conf. Software Eng., pp. 387–396, 1996.
[50] B. Shneiderman and R. Mayer, “Syntactic/Semantic Interactions in Programmer Behaviour: A Model and Experimental Results,” Int'l J. Computer and Information Sciences, vol. 8, no. 3, pp. 219–238, Mar. 1979.
[51] E. Soloway and K. Ehrlich, “Empirical Studies of Programming Knowledge,” IEEE Trans. Software Eng., vol. 10, no. 5, pp. 595–609, 1994.
[52] R.J. Turver and M. Munro, “An Early Impact Analysis Technique for Software Maintenance,” J. Software Maintenance—Research and Practice, vol. 6, no. 1, pp. 35–52, 1994.
[53] I. Vessey, “Expertise in Debbugging Computer Programs: A Process Analysis,” Int'l J. Man-Machine Studies, vol. 23, pp. 459–494, 1985.
[54] J. Weidl and H. Gall, “Binding Object Models to Source Code,” Proc. 22nd Computer Software and Applications Conf. (COMPSAC '98), pp. 26–31, Aug. 1998.
[55] I.H. Witten and T.C. Bell, “The Zero-Frequency Problem: Estimating the Probabilities of Novel Events in Adaptive Text Compression,” IEEE Trans. Information Theory, vol. 37, pp. 1085–1094, 1991.

Index Terms:
Redocumentation, traceability, program comprehension, object orientation, information retrieval.
Giuliano Antoniol, Gerardo Canfora, Gerardo Casazza, Andrea De Lucia, Ettore Merlo, "Recovering Traceability Links between Code and Documentation," IEEE Transactions on Software Engineering, vol. 28, no. 10, pp. 970-983, Oct. 2002, doi:10.1109/TSE.2002.1041053
Usage of this product signifies your acceptance of the Terms of Use.