This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Information-Theoretic Software Clustering
February 2005 (vol. 31 no. 2)
pp. 150-165
Periklis Andritsos, IEEE Computer Society
Vassilios Tzerpos, IEEE Computer Society
The majority of the algorithms in the software clustering literature utilize structural information to decompose large software systems. Approaches using other attributes, such as file names or ownership information, have also demonstrated merit. At the same time, existing algorithms commonly deem all attributes of the software artifacts being clustered as equally important, a rather simplistic assumption. Moreover, no method that can assess the usefulness of a particular attribute for clustering purposes has been presented in the literature. In this paper, we present an approach that applies information theoretic techniques in the context of software clustering. Our approach allows for weighting schemes that reflect the importance of various attributes to be applied. We introduce LIMBO, a scalable hierarchical clustering algorithm based on the minimization of information loss when clustering a software system. We also present a method that can assess the usefulness of any nonstructural attribute in a software clustering context. We applied LIMBO to three large software systems in a number of experiments. The results indicate that this approach produces clusterings that come close to decompositions prepared by system experts. Experimental results were also used to validate our usefulness assessment method. Finally, we experimented with well-established weighting schemes from information retrieval, web search, and data clustering. We report results as to which weighting schemes show merit in the decomposition of software systems.

[1] P. Andritsos and R.J. Miller, “Reverse Engineering Meets Data Analysis,” Proc. Ninth Int'l Workshop Program Comprehension, pp. 157-166, May 2001.
[2] P. Andritsos, P. Tsaparas, R.J. Miller, and K.C. Sevcik, “LIMBO: Scalable Clustering of Categorical Data,” Proc. Ninth Int'l Conf. Extending DataBase Technology (EDBT), 2004.
[3] N. Anquetil and T. Lethbridge, “File Clustering Using Naming Conventions for Legacy Systems,” Proc. Ann. IBM Centers for Advanced Studies Conf., pp. 184-195, Nov. 1997.
[4] N. Anquetil and T. Lethbridge, “Experiments with Clustering as a Software Remodularization Method,” Proc. Sixth Working Conf. Reverse Eng., pp. 235-255, Oct. 1999.
[5] N. Anquetil and T.C. Lethbridge, “Recovering Software Architecture from the Names of Source Files,” J. Software Maintenance: Research and Practice, vol. 11, pp. 201-221, May 1999.
[6] N. Anquetil and T.C. Lethbridge, “Comparative Study of Clustering Algorithms and Abstract Representations for Software Remodularisation,” IEE Proc. Software, vol. 150, pp. 185-201, June 2003.
[7] R.B. Yates and B.R. Neto, Modern Information Retrieval. Addison-Wesley-Longman, 1999.
[8] M. Bauer and M. Trifu, “Architecture-Aware Adaptive Clustering of OO Systems,” Proc. Eighth European Conf. Software Maintenance and Reeng., pp. 3-14, Mar. 2004.
[9] I.T. Bowman and R.C. Holt, “Software Architecture Recovery Using Conway's Law,” Proc. Ann. IBM Centers for Advanced Studies Conf., pp. 123-133, Nov. 1998.
[10] I.T. Bowman and R.C. Holt, “Reconstructing Ownership Architectures to Help Understand Software Systems,” Proc. Seventh Int'l Workshop Program Comprehension, May 1999.
[11] I.T. Bowman, R.C. Holt, and N.V. Brewster, “Linux as a Case Study: Its Extracted Software Architecture,” Proc. 21st Int'l Conf. Software Eng., May 1999.
[12] S. Brin and L. Page, “The Anatomy of a Large-Scale Hypertextual Web Search Engine,” Computer Networks, vol. 30, nos. 1-7, pp. 107-117, 1998.
[13] S.C. Choi and W. Scacchi, “Extracting and Restructuring the Design of Large Systems,” IEEE Software, pp. 66-71, Jan. 1990.
[14] T.M. Cover and J.A. Thomas, Elements of Information Theory. Wiley and Sons, 1991.
[15] D. Gibson, J.M. Kleinberg, and P. Raghavan, “Clustering Categorical Data: An Approach Based on Dynamical Systems,” Proc. Conf. Very Large Databases, 1998.
[16] M.W. Godfrey and E.H.S. Lee, “Secrets from the Onster: Extracting Mozilla's Software Architecture,” Proc. Second Int'l Symp. Constructing Software Eng. Tools (CoSET), 2000.
[17] D.H. Hutchens and V.R. Basili, “System Structure Analysis: Clustering with Data Bindings,” IEEE Trans. Software Eng., vol. 11, no. 8, pp. 749-757, Aug. 1985.
[18] R. Koschke, “Atomic Architectural Component Recovery for Program Understanding and Evolution,” PhD thesis, Inst. for Computer Science, Univ. of Stuttgart, 2000.
[19] R. Koschke and T. Eisenbarth, “A Framework for Experimental Evaluation of Clustering Techniques,” Proc. Eighth Int'l Workshop Program Comprehension, pp. 201-210, June 2000.
[20] C. Lindig and G. Snelting, “Assessing Modular Structure of Legacy Code Based on Mathematical Concept Analysis,” Proc. 19th Int'l Conf. Software Eng., pp. 349-359, May 1997.
[21] R. Lutz, “Recovering High-Level Structure of Software Systems Using a Minimum Description Length Principle,” Proc. 13th Irish Conf. Artificial Intelligence and Cognitive Science, Sept. 2002.
[22] S. Mancoridis, B. Mitchell, Y. Chen, and E. Gansner, “Bunch: A Clustering Tool for the Recovery and Maintenance of Software System Structures,” Proc. Int'l Conf. Software Maintenance, 1999.
[23] E. Merlo, I. McAdam, and R.D. Mori, “Source Code Informal Information Analysis Using Connectionist Models,” Int'l Joint Conf. Artificial Intelligence, pp. 1339-1344, 1993.
[24] B.S. Mitchell and S. Mancoridis, “Comparing the Decompositions Produced by Software Clustering Algorithms Using Similarity Measurements,” Proc. Int'l Conf. Software Maintenance, pp. 744-753, Nov. 2001.
[25] H.A. Müller, M.A. Orgun, S.R. Tilley, and J.S. Uhl, “A Reverse Engineering Approach to Subsystem Structure Identification,” J. Software Maintenance: Research and Practice, vol. 5, pp. 181-204, Dec. 1993.
[26] N. Slonim, R. Somerville, N. Tishby, and O. Lahav, “Objective Classification of Galaxies Spectra Using the Information Bottleneck Method,” Monthly Notices of the Royal Astronomical Soc. (MNRAS), vol. 323, no. 270, 2001.
[27] K. Sartipi and K. Kontogiannis, “A Graph Pattern Matching Approach to Software Architecture Recovery,” Proc. Int'l Conf. Software Maintenance, pp. 408-419, Nov. 2001.
[28] R.W. Schwanke, “An Intelligent Tool for Re-Engineering Software Modularity,” Proc. 13th Int'l Conf. Software Eng., pp. 83-92, May 1991.
[29] R.W. Schwanke and M.A. Platoff, “Cross References Are Features,” Second Int'l Workshop Software Configuration Management, pp. 86-95, 1989.
[30] M. Siff and T. Reps, “Identifying Modules Via Concept Analysis,” Proc. Int'l Conf. Software Maintenance, pp. 170-179, Oct. 1997.
[31] N. Slonim and N. Tishby, “Agglomerative Information Bottleneck,” Proc. Conf. Neural Information Processing Systems (NIPS-12), pp. 617-623, 1999.
[32] N. Slonim and N. Tishby, “Document Clustering Using Word Clusters via the Information Bottleneck Method,” Proc. 23rd Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 208-215, 2000.
[33] N. Tishby, F.C. Pereira, and W. Bialek, “The Information Bottleneck Method,” Proc. 37th Ann. Allerton Conf. Comm., Control and Computing, 1999.
[34] V. Tzerpos and R.C. Holt, “MoJo: A Distance Metric for Software Clusterings,” Proc. Sixth Working Conf. Reverse Eng., pp. 187-193, Oct. 1999.
[35] V. Tzerpos and R.C. Holt, “ACDC: An Algorithm For Comprehension-Driven Clustering,” Proc. Seventh Working Conf. Reverse Eng., pp. 258-267, Nov. 2000.
[36] Z. Wen and V. Tzerpos, “An Optimal Algorithm for MoJo Distance,” Proc. 11th Int'l Workshop Program Comprehension, pp. 227-235, May 2003.
[37] C. Xiao and V. Tzerpos, “Software Clustering Based on Dynamic Dependencies,” Proc. Ninth European Conf. Software Maintenance and Reeng., to appear.

Index Terms:
Reverse engineering, reengineering, architecture reconstruction, clustering, information theory.
Citation:
Periklis Andritsos, Vassilios Tzerpos, "Information-Theoretic Software Clustering," IEEE Transactions on Software Engineering, vol. 31, no. 2, pp. 150-165, Feb. 2005, doi:10.1109/TSE.2005.25
Usage of this product signifies your acceptance of the Terms of Use.