The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.08 - Aug. (2013 vol.39)
pp: 1054-1068
Fayola Peters , West Virginia University, Morgantown
Tim Menzies , West Virginia University, Morgantown
Liang Gong , Tsinghua University, Beijing
Hongyu Zhang , Tsinghua University, Beijing
ABSTRACT
Background: Cross-company defect prediction (CCDP) is a field of study where an organization lacking enough local data can use data from other organizations for building defect predictors. To support CCDP, data must be shared. Such shared data must be privatized, but that privatization could severely damage the utility of the data. Aim: To enable effective defect prediction from shared data while preserving privacy. Method: We explore privatization algorithms that maintain class boundaries in a dataset. CLIFF is an instance pruner that deletes irrelevant examples. MORPH is a data mutator that moves the data a random distance, taking care not to cross class boundaries. CLIFF+MORPH are tested in a CCDP study among 10 defect datasets from the PROMISE data repository. Results: We find: 1) The CLIFFed+MORPHed algorithms provide more privacy than the state-of-the-art privacy algorithms; 2) in terms of utility measured by defect prediction, we find that CLIFF+MORPH performs significantly better. Conclusions: For the OO defect data studied here, data can be privatized and shared without a significant degradation in utility. To the best of our knowledge, this is the first published result where privatization does not compromise defect prediction.
INDEX TERMS
Testing, Software, Genetic algorithms, Sociology, Statistics, Search problems, Arrays, defect prediction, Privacy, classification
CITATION
Fayola Peters, Tim Menzies, Liang Gong, Hongyu Zhang, "Balancing Privacy and Utility in Cross-Company Defect Prediction", IEEE Transactions on Software Engineering, vol.39, no. 8, pp. 1054-1068, Aug. 2013, doi:10.1109/TSE.2013.6
REFERENCES
[1] E. Kocaguneli and T. Menzies, "How to Find Relevant Data for Effort Estimation?" Proc. Int'l Symp. Empirical Software Eng. and Measurement, pp. 255-264, 2011.
[2] B. Turhan, T. Menzies, A. Bener, and J.D. Stefano, "On the Relative Value of Cross-Company and Within-Company Data for Defect Prediction," Empirical Software Eng., vol. 14, pp. 540-578, 2009.
[3] E. Kocaguneli, G. Gay, T. Menzies, Y. Yang, and J.W.J. Keung, "When to Use Data from Other Projects for Effort Estimation," Proc. IEEE/ACM Int'l Conf. Automated Software Eng., pp. 321-324, 2010.
[4] E. Weyuker, T. Ostrand, and R. Bell, "Do Too Many Cooks Spoil the Broth? Using the Number of Developers to Enhance Defect Prediction Models," Empirical Software Eng., vol. 13, pp. 539-559, Oct. 2008.
[5] M. Grechanik, C. Csallner, C. Fu, and Q. Xie, "Is Data Privacy Always Good for Software Testing?" Proc. 21st IEEE Int'l Symp. Software Reliability Eng., pp. 368-377, 2010.
[6] J. Brickell and V. Shmatikov, "The Cost of Privacy: Destruction of Data-Mining Utility in Anonymized Data Publishing," Proc. 14th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 70-78, 2008.
[7] F. Peters and T. Menzies, "Privacy and Utility for Defect Prediction: Experiments with MORPH," Proc. 34th Int'l Conf. Software Eng., pp. 189-199, June 2012.
[8] R. Reed, "Pruning Algorithms-A Survey," IEEE Trans. Neural Networks, vol. 4, no. 5, pp. 740-747, Sept. 1993.
[9] S. Kotsiantis and D. Kanellopoulos, "Discretization Techniques: A Recent Survey," GESTS Int'l Trans. Computer Science and Eng., vol. 32, no. 1, pp. 47-58, 2006.
[10] T.J.T. Ostrand, E.J.E. Weyuker, and R.M.R. Bell, "Where the Bugs Are," Proc. ACM SIGSOFT Int'l Symp. Software Testing and Analysis, pp. 86-96, 2004.
[11] T. Menzies, J. Greenwald, and A. Frank, "Data Mining Static Code Attributes to Learn Defect Predictors," IEEE Trans. Software Eng., vol. 33, no. 1, pp. 2-13, Jan. 2007.
[12] T. Menzies, Z. Milton, B. Turhan, B. Cukic, Y. Jiang, and A. Bener, "Defect Prediction from Static Code Features: Current Results, Limitations, New Approaches," Automated Software Eng., vol. 17, pp. 375-407, 2010.
[13] A. Tosun, B. Turhan, and A. Bener, "Practical Considerations in Deploying AI for Defect Prediction: A Case Study within the Turkish Telecommunication Industry," Proc. Fifth Int'l Conf. Predictor Models in Software Eng., pp. 11:1-11:9, 2009.
[14] T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy, "Cross-Project Defect Prediction: A Large Scale Experiment on Data versus Domain versus Process," Proc. Seventh Joint Meeting of the European Software Eng. Conf. and the ACM SIGSOFT Symp. Foundations of Software Eng., pp. 91-100, 2009.
[15] B.A. Kitchenham, E. Mendes, and G.H. Travassos, "Cross versus Within-Company Cost Estimation Studies: A Systematic Review," IEEE Trans. Software Eng., vol. 33, no. 5, pp. 316-329, May 2007.
[16] Health Information Privacy, http://www.hhs.gov/ocrprivacy/, 2007.
[17] L. Sweeney, "K-Anonymity: A Model for Protecting Privacy," Int'l J. Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 10, no. 5, pp. 557-570, 2002.
[18] T. Menzies, B. Caglayan, E. Kocaguneli, J. Krall, F. Peters, and B. Turhan, The Promise Repository of Empirical Software Engineering Data, promisedata.googlecode.com, June 2012.
[19] B.C.M. Fung, R. Chen, and P.S. Yu, "Privacy-Preserving Data Publishing: A Survey on Recent Developments," Computing, vol. V, no. 4, pp. 1-53, 2010.
[20] D. Zhu, X.-B. Li, and S. Wu, "Identity Disclosure Protection: A Data Reconstruction Approach for Privacy-Preserving Data Mining," Decision Support Systems, vol. 48, no. 1, pp. 133-140, Dec. 2009.
[21] V. Verykios, E. Bertino, I. Fovin, L. Provenza, Y. Saygin, and Y. Theodoridis, "State-of-the-Art in Privacy Preserving Data Mining," Proc. ACM SIGMOD Record, vol. 33, no. 1, pp. 50-57, Mar. 2004.
[22] C. Giannella, K. Liu, and H. Kargupta, "On the Privacy of Euclidean Distance Preserving Data Perturbation," Computing Research Repository, 2009.
[23] P. Samarati and L. Sweeney, "Protecting Privacy When Disclosing Information: K-Anonymity and Its Enforcement through Generalization and Suppression," 1998.
[24] H. Park and K. Shim, "Approximate Algorithms with Generalizing Attribute Values for K-Anonymity," Information Systems, vol. 35, no. 8, pp. 933-955, Dec. 2010.
[25] A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam, "L-Diversity: Privacy Beyond K-Anonymity," ACM Trans. Knowledge Discovery from Data, vol. 1, Mar. 2007.
[26] J. Li and G. Ruhe, "Decision Support Analysis for Software Effort Estimation by Analogy," Proc. Int'l Workshop Predictor Models in Software Eng., 2007.
[27] C. Dwork, "Differential Privacy," Automata, Languages and Programming, M. Bugliesi, B. Preneel, V. Sassone, and I. Wegener, eds., vol. 4052, pp. 1-12, Springer, 2006.
[28] C. Dwork, "Differential Privacy: A Survey of Results," Proc. Fifth Int'l Conf. Theory and Applications of Models of Computation, pp. 1-19, 2008.
[29] I. Dinur and K. Nissim, "Revealing Information While Preserving Privacy," Proc. 22nd ACM SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems, pp. 202-210, 2003.
[30] A. Chin and A. Klinefelter, "Differential Privacy as a Response to the Reidentification Threat: The Facebook Advertiser Case Study," North Carolina Law Rev., vol. 90, no. 5, 2012.
[31] F. Peters, "Cliff: Finding Prototypes for Nearest Neighbor Algorithms with Application to Forensic Trace Evidence," master's thesis, 2010, copyright ProQuest, UMI Dissertations Publishing, http://search.proquest.com/docview 859578571?accountid=2837 , 2010.
[32] O. Jalali, T. Menzies, and M. Feather, "Optimizing Requirements Decisions with Keys," Proc. Fourth Int'l Workshop Predictor Models in Software Eng., http://menzies.us/pdf08keys.pdf, 2008.
[33] Y. Jiang, B. Cukic, and Y. Ma, "Techniques for Evaluating Fault Prediction Models," Empirical Software Eng., vol. 13, pp. 561-595, 2008.
[34] T. Menzies, A. Dekhtyar, J. Distefano, and J. Greenwald, "Problems with Precision: A Response to "Comments on 'Data Mining Static Code Attributes to Learn Defect Predictors'," IEEE Trans. Software Eng., vol. 33, no. 9, pp. 637-640, Sept. 2007.
[35] M. Jureczko and L. Madeyski, "Towards Identifying Software Project Clusters with Regard to Defect Prediction," Proc. Sixth Int'l Conf. Predictive Models in Software Eng., pp. 9:1-9:10, 2010.
[36] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I.H. Witten, "The Weka Data Mining Software: An Update," ACM SIGKDD Explorations Newsletter, vol. 11, pp. 10-18, Nov. 2009.
[37] D. Lewis, "Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval," Proc. 10th European Conf. Machine Learning, pp. 4-15, 1998.
[38] J.C.J. Platt, "Fast Training of Support Vector Machines Using Sequential Minimal Optimization," Advances in Kernel Methods, B. Schölkopf, C.J.C. Burges, and A.J.A. Smola, eds., pp. 185-208, MIT Press, 1999.
[39] C. Bishop and G. Hinton, Neural Networks for Pattern Recognition, Clarendon Press, http://books.google.combooks?id=-aAwQO\_ -rXwC , 1995.
[40] S. Lessmann, B. Baesens, C. Mues, and S. Pietsch, "Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings," IEEE Trans. Software Eng., vol. 34, no. 4, pp. 485-496, July/Aug. 2008.
[41] T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell, "A Systematic Literature Review of Fault Prediction Performance in Software Engineering," IEEE Trans. Software Eng., vol. 38, no. 6, pp. 1276-1304, Nov.-Dec. 2012.
[42] E. Osuna, R. Freund, and F. Girosit, "Training Support Vector Machines: An Application to Face Detection," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 130-136, June 1997.
[43] K. Taneja, M. Grechanik, R. Ghani, and T. Xie, "Testing Software in Age of Data Privacy: A Balancing Act," Proc. 19th ACM SIGSOFT Symp. and 13th European Conf. Foundations of Software Eng., pp. 201-211, 2011.
[44] Q. Zhang, N. Koudas, D. Srivastava, and T. Yu, "Aggregate Query Answering on Anonymized Tables," Proc. 23rd IEEE Int'l Conf. Data Eng., pp. 116-125, 2007.
[45] L. Sweeney, "Achieving K-Anonymity Privacy Protection Using Generalization and Suppression," Int'l J. Uncertainty Fuzziness Knowledge-Based Systems, vol. 10, no. 5, pp. 571-588, Oct. 2002.
[46] H. Brighton and C. Mellish, "Advances in Instance Selection for Instance-Based Learning Algorithms," Data Mining and Knowledge Discovery, vol. 6, pp. 153-172, 2002.
[47] J. Bezdek and L. Kuncheva, "Some Notes on Twenty One (21) Nearest Prototype Classifiers," Proc. Joint IAPR Int'l Workshops Advances in Pattern Recognition, pp. 1-16, 2000.
[48] J.C.J. Bezdek and L.I.L. Kuncheva, "Nearest Prototype Classifier Designs: An Experimental Study," Int'l J. Intelligent Systems, vol. 16, no. 12, pp. 1445-1473, http://dx.doi.org/10.1002int.1068, 2001.
[49] S. Kim and B. Oommen, "A Brief Taxonomy and Ranking of Creative Prototype Reduction Schemes," Pattern Analysis and Applications, vol. 6, no. 3, pp. 232-244, Dec. 2003.
[50] J. Olvera-Lpez, J. Carrasco-Ochoa, and J. Martnez-Trinidad, "A New Fast Prototype Selection Method Based on Clustering," Pattern Analysis and Applications, vol. 13, pp. 131-141, 2010.
[51] H.B.H. Mann and D.R.D. Whitney, "On a Test of Whether One of Two Random Variables Is Stochastically Larger than the Other," The Annals of Math. Statistics, vol. 18, no. 1, pp. 50-60, http://www.jstor.org/stable2236101, 1947.
[52] M. Castro, M. Costa, and J.-P. Martin, "Better Bug Reporting with Better Privacy," Proc. 13th Int'l Conf. Architectural Support for Programming Languages and Operating Systems, pp. 319-328, http://doi.acm.org/10.11451346281.1346322 , 2008.
[53] J. Clause and A. Orso, "Camouflage: Automated Anonymization of Field Data," Proc. 33rd Int'l Conf. Software Eng., p. 2130, 2011.
[54] A. Budi, D. Lo, L. Jiang, and Lucia, "Kb-Anonymity: A Model for Anonymized Behaviour-Preserving Test and Debugging Data," Proc. 32nd ACM SIGPLAN Conf. Programming Language Design and Implementation, pp. 447-457, http://doi.acm.org/10.11451993498.1993551 , 2011.
[55] R. Agrawal and R. Srikant, "Privacy-Preserving Data Mining," ACM SIGMOD Record, vol. 29, no. 2., pp. 439-450, 2000.
[56] S. Kim, H. Zhang, R. Wu, and L. Gong, "Dealing with Noise in Defect Prediction," Proc. 33rd Int'l Conf. Software Eng., pp. 481-490, 2011.
[57] B. Boehm and P. Papaccio, "Understanding and Controlling Software Costs," IEEE Trans. Software Eng., vol. 14, no. 10, pp. 1462-1477, Oct. 1988.
[58] J.B. Dabney, G. Barber, and D. Ohi, "Predicting Software Defect Function Point Ratios Using a Bayesian Belief Network," Proc. Second Int'l Conf. Predictive Models in Software Eng., 2006.
17 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool