This Article 
 Bibliographic References 
 Add to: 
Using Mutation Analysis for Assessing and Comparing Testing Coverage Criteria
August 2006 (vol. 32 no. 8)
pp. 608-624
The empirical assessment of test techniques plays an important role in software testing research. One common practice is to seed faults in subject software, either manually or by using a program that generates all possible mutants based on a set of mutation operators. The latter allows the systematic, repeatable seeding of large numbers of faults, thus facilitating the statistical analysis of fault detection effectiveness of test suites; however, we do not know whether empirical results obtained this way lead to valid, representative conclusions. Focusing on four common control and data flow criteria (Block, Decision, C-Use, and P-Use), this paper investigates this important issue based on a middle size industrial program with a comprehensive pool of test cases and known faults. Based on the data available thus far, the results are very consistent across the investigated criteria as they show that the use of mutation operators is yielding trustworthy results: Generated mutants can be used to predict the detection effectiveness of real faults. Applying such a mutation analysis, we then investigate the relative cost and effectiveness of the above-mentioned criteria by revisiting fundamental questions regarding the relationships between fault detection, test suite size, and control/data flow coverage. Although such questions have been partially investigated in previous studies, we can use a large number of mutants, which helps decrease the impact of random variation in our analysis and allows us to use a different analysis approach. Our results are then compared with published studies, plausible reasons for the differences are provided, and the research leads us to suggest a way to tune the mutation analysis process to possible differences in fault detection probabilities in a specific environment.

[1] J.H. Andrews, L.C. Briand, and Y. Labiche, “Is Mutation an Appropriate Tool for Testing Experiments?” Proc. IEEE Int'l Conf. Software Eng., pp. 402-411, 2005.
[2] J.H. Andrews and Y. Zhang, “General Test Result Checking with Log File Analysis,” IEEE Trans. Software Eng., vol. 29, no. 7, pp.634-648, July 2003.
[3] B. Beizer, Software Testing Techniques, second ed. Van Nostrand Reinhold, 1990.
[4] L.C. Briand, Y. Labiche, and Y. Wang, “Using Simulation to Empirically Investigate Test Coverage Criteria,” Proc. IEEE/ACM Int'l Conf. Software Eng., pp. 86-95, 2004.
[5] T.A. Budd and D. Angluin, “Two Notions of Correctness and Their Relation to Testing,” Acta Informatica, vol. 18, no. 1, pp. 31-45, 1982.
[6] D.T. Campbell and J.C. Stanley, Experimental and Quasi-Experimental Designs for Research. Houghton Mifflin Company, 1990.
[7] W. Chen, R.H. Untch, G. Rothermel, S. Elbaum, and J. von Ronne, “Can Fault-Exposure-Potential Estimates Improve the Fault Detection Abilities of Test Suites?” Software Testing, Verification, and Reliability, vol. 12, no. 4, pp. 197-218, 2002.
[8] R.A. DeMillo, R.J. Lipton, and F.G. Sayward, “Hints on Test Data Selection: Help for the Practicing Programmer,” Computer, vol. 11, no. 4, pp. 34-41, Apr. 1978.
[9] R.L. Eubank, Spline Smoothing and Nonparametric Regression. Marcel Dekker, 1988.
[10] N.E. Fenton and S.L. Pfleeger, Software Metrics: A Rigorous and Practical Approach, second ed. PWS Publishing, 1998.
[11] P.G. Frankl, O. Iakounenko, “Further Empirical Studies of Test Effectiveness,” Proc. Sixth ACM SIGSOFT Int'l Symp. Foundations of Software Eng., pp. 153-162, Nov. 1998.
[12] P.G. Frankl and S.N. Weiss, “An Experimental Comparison of the Effectiveness of the All-Uses and All-Edges Adequacy Criteria,” Proc. Fourth Symp. Testing, Analysis, and Verification, pp. 154-164, 1991.
[13] P.G. Frankl and S.N. Weiss, “An Experimental Comparison of the Effectiveness of Branch Testing and Data Flow Testing,” IEEE Trans. Software Eng., vol. 19, no. 8, pp. 774-787, Aug. 1993.
[14] D. Hamlet and J. Maybee, The Engineering of Software. Addison Wesley, 2001.
[15] R.G. Hamlet, “Testing Programs with the Aid of a Compiler,” IEEE Trans. Software Eng., vol. 3, no. 4, pp. 279-290, 1977.
[16] M. Harder, J. Mellen, and M.D. Ernst, “Improving Test Suites via Operational Abstraction,” Proc. 25th Int'l Conf. Software Eng., pp.60-71, May 2003.
[17] M. Hutchins, H. Froster, T. Goradia, and T. Ostrand, “Experiments on the Effectiveness of Dataflow- and Controlflow-Based Test Adequacy Criteria,” Proc. 16th IEEE Int'l Conf. Software Eng., pp. 191-200, May 1994.
[18] S. Kim, J.A. Clark, and J.A. McDermid, “Investigating the Effectiveness of Object-Oriented Testing Strategies with the Mutation Method,” Software Testing, Verification, and Reliability, vol. 11, no. 3, pp. 207-225, 2001.
[19] R.E. Kirk, “Practical Significance: A Concept Whose Time Has Come,” Educational and Psychological Measurement, vol. 56, no. 5, pp. 746-759, 1996.
[20] M.R. Lyu, J.R. Horgan, and S. London, “A Coverage Analysis Tool for the Effectiveness of Software Testing,” IEEE Trans. Reliability, vol. 43, no. 4, pp. 527-535, 1994.
[21] A.M. Memon, I. Banerjee, and A. Nagarajan, “What Test Oracle Should I Use for Effective GUI Testing?” Proc. IEEE Int'l Conf. Automated Software Eng. (ASE '03), pp. 164-173, Oct. 2003.
[22] A.J. Offutt, “Investigations of the Software Testing Coupling Effect,” ACM Trans. Software Eng. and Methodology, vol. 1, no. 1, pp.3-18, 1992.
[23] A.J. Offutt, A. Lee, G. Rothermel, R.H. Untch, and C. Zapf, “An Experimental Determination of Sufficient Mutation Operators,” ACM Trans. Software Eng. and Methodology, vol. 5, no. 2, pp. 99-118, 1996.
[24] A.J. Offutt and J. Pan, “Detecting Equivalent Mutants and the Feasible Path Problem,” Software Testing, Verification, and Reliability, vol. 7, no. 3, pp. 165-192, 1997.
[25] A.J. Offutt and R.H. Untch, “Mutation 2000: Uniting the Orthogonal,” Proc. Mutation, pp. 45-55, Oct. 2000.
[26] A. Pasquini, A. Crespo, and P. Matrelle, “Sensitivity of Reliability-Growth Models to Operational Profiles Errors vs Testing Accuracy,” IEEE Trans. Reliability, vol. 45, no. 4, pp. 531-540, 1996.
[27] S. Rapps and E.J. Weyuker, “Selecting Software Test Data Using Data Flow Information,” IEEE Trans. Software Eng., vol. 11, no. 4, pp. 367-375, Apr. 1985.
[28] G. Rothermel, R.H. Untch, C. Chu, and M.J. Harrold, “Prioritizing Test Cases for Regression Testing,” IEEE Trans. Software Eng., vol. 27, no. 10, pp. 929-948, Oct. 2001.
[29] T.P. Ryan, Modern Regression Methods. Wiley, 1996.
[30] P. Thévenod-Fosse, H. Waeselynck, and Y. Crouzet, “An Experimental Study on Software Structural Testing: Deterministic versus Random Input Generation,” Proc. 21st Int'l Symp. Fault-Tolerant Computing, pp. 410-417, June 1991.
[31] F.I. Vokolos and P.G. Frankl, “Empirical Evaluation of the Textual Differencing Regression Testing Technique,” Proc. IEEE Int'l Conf. Software Maintenance, pp. 44-53, Mar. 1998.
[32] C. Wohlin, P. Runeson, M. Host, M.C. Ohlsson, B. Regnell, and A. Wesslen, Experimentation in Software Engineering—An Introduction. Kluwer, 2000.
[33] W.E. Wong, J.R. Horgan, A.P. Mathur, and A. Pasquini, “Test Set Size Minimization and Fault Detection Effectiveness: A Case Study in a Space Application,” Technical Report TR-173-P, Software Eng. Research Center (SERC), 1997.

Index Terms:
Testing and debugging, testing strategies, test coverage of code, experimental design.
James H. Andrews, Lionel C. Briand, Yvan Labiche, Akbar Siami Namin, "Using Mutation Analysis for Assessing and Comparing Testing Coverage Criteria," IEEE Transactions on Software Engineering, vol. 32, no. 8, pp. 608-624, Aug. 2006, doi:10.1109/TSE.2006.83
Usage of this product signifies your acceptance of the Terms of Use.