This Article 
 Bibliographic References 
 Add to: 
Cost-Constrained Data Acquisition for Intelligent Data Preparation
November 2005 (vol. 17 no. 11)
pp. 1542-1556
Xindong Wu, IEEE
Real-world data is noisy and can often suffer from corruptions or incomplete values that may impact the models created from the data. To build accurate predictive models, data acquisition is usually adopted to prepare the data and complete missing values. However, due to the significant cost of doing so and the inherent correlations in the data set, acquiring correct information for all instances is prohibitive and unnecessary. An interesting and important problem that arises here is to select what kinds of instances to complete so the model built from the processed data can receive the "maximum” performance improvement. This problem is complicated by the reality that the costs associated with the attributes are different, and fixing the missing values of some attributes is inherently more expensive than others. Therefore, the problem becomes that given a fixed budget, what kinds of instances should be selected for preparation, so that the learner built from the processed data set can maximize its performance? In this paper, we propose a solution for this problem, and the essential idea is to combine attribute costs and the relevance of each attribute to the target concept, so that the data acquisition can pay more attention to those attributes that are cheap in price but informative for classification. To this end, we will first introduce a unique Economical Factor (EF) that seamlessly integrates the cost and the importance (in terms of classification) of each attribute. Then, we will propose a cost-constrained data acquisition model, where active learning, missing value prediction, and impact-sensitive instance ranking are combined for effective data acquisition. Experimental results and comparative studies from real-world data sets demonstrate the effectiveness of our method.

[1] M. Berry and G. Linoff, Mastering Data Mining. Wiley, 1999.
[2] J. Han and M. Kamber, Data Mining: Concepts and Techniques. Morgan Kaufmann, 2001.
[3] D. Pyle, Data Preparation for Data Mining. Morgan Kauffman, 1999.
[4] T. Redman, Data Quality for the Information Age. Artech House, 1996.
[5] J. Quinlan, “Unknown Attribute Values in Induction,” Proc. Sixth Int'l Conf. Machine Learning Workshop, pp. 164-168, 1989.
[6] R. Greiner, A. Grove, and A. Kogan, “Knowing What Doesn't Matter: Exploiting the Omission of Irrelevant Data,” Artificial Intelligence, vol. 97, nos. 1-2, Dec. 1997.
[7] D. Schuurmans and R. Greiner, Learning to Classify Incomplete Examples, In Computational Learning Theory and Natural Learning Systems: Making Learning Systems Practical. MIT Press, 1996.
[8] N. Friedman, “Learning Belief Networks in the Presence of Missing Values and Hidden Variables,” Proc. Int'l Conf. Machine Learning, pp. 125-133, 1997.
[9] L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and Regression Trees. Wadsworth & Brooks, 1984.
[10] A. Shapiro, Structured Induction in Expert Systems. Addison-Wesley, 1987.
[11] R. Little and D. Rubin, Statistical Analysis with Missing Data. New York: Wiley, 1987.
[12] P. Clark and T. Niblett, “The CN2 Induction Algorithm,” Machine Learning, vol. 3, no. 4, pp. 261-283, 1989.
[13] I. Kononenko, I. Bratko, and E. Roskar, “Experiments in Automatic Learning of Medical Diagnostic Rules,” technical report, Jozef Stefan Inst., Ljubljana, Yugoslavia, 1984.
[14] S. Tseng, K. Wang, and C. Lee, “A Pre-Processing Method to Deal with Missing Values by Integrating Clustering and Regression Techniques,” Applied Artificial Intelligence, vol. 17, nos. 5-6, pp. 535-544, 2003.
[15] J. Quinlan, C4. 5: Programs for Machine Learning. San Mateo, Calif.: Morgan Kaufmann, 1993.
[16] J. Quinlan, “Induction of Decision Trees,” Machine Learning, vol. 1, pp. 81-106, 1986.
[17] D. Wilson, “Asymptotic Properties of Nearest Neighbor Rules Using Edited Data,” IEEE Trans. Systems, Man, and Cybernetics, vol. 2, pp. 408-421, 1972.
[18] D. Aha, D. Kibler, and M. Albert, “Instance-Based Learning Algorithms,” Machine Learning, vol. 6, no. 1, pp. 37-66, 1991.
[19] D. Wilson and T. Martinez, “Reduction Techniques for Examplar-Based Learning Algorithms,” Machine Learning, vol. 38, no. 3, pp. 257-268, 2000.
[20] P. Winston, “Learning Structural Descriptions from Examples.” The Psychology of Computer Vision, New York: McGraw-Hill, 1975.
[21] F. Provost, D. Jensen, and T. Oates, “Efficient Progressive Sampling,” Proc. Fifth ACM SIGKDD, pp. 23-32, 1999.
[22] D. Lewis and J. Catlett, “Heterogeneous Uncertainty Sampling for Supervised Learning,” Proc. 11th Int'l Conf. Machine Learning, pp. 148-156, 1994.
[23] D. Cohn, L. Atlas, and R. Ladner, “Improving Generalization with Active Learning,” Machine Learning, vol. 15, pp. 201-221, 1994.
[24] H. Seung, M. Opper, and H. Sompolinsky, “Query by Committee,” Proc. ACM Workshop Computational Learning Theory, 1992.
[25] D. Mackay, “Information-Based Objective Functions for Active Data Selection,” Neural Computation, vol. 4, no. 4, pp. 590-604, 1992.
[26] D. Lewis and W. Gale, “A Sequential Algorithm for Training Text Classifiers,” Proc. Int'l SIG-IR Conf. Research and Development in Information Retrieval, pp. 3-12, 1994.
[27] Z. Zheng and B. Padmanabhan, “On Active Learning for Data Acquisition,” Proc. IEEE Conf. Data Mining, pp. 562-569, 2002.
[28] X. Zhu, X. Wu, and Y. Yang, “Error Detection and Impact-Sensitive Instance Ranking in Noisy Data Set,” Proc. 19th Nat'l Conf. Artificial Intelligence (AAAI), 2004.
[29] X. Zhu and X. Wu, “Data Acquisition with Active and Impact-Sensitive Instance Selection,” Proc. IEEE Int'l Conf. Tools with Artificial Intelligence (ICTAI), 2004.
[30] D. Lizotte, O. Madani, and R. Greiner, “Budgeted Learning of Naive-Bayes Classifiers,” Proc. Uncertainty in Artificial Intelligence, 2003.
[31] P. Turney, “Cost-Sensitive Classification: Empirical Evaluation of a Hybrid Genetic Decision Tree Induction Algorithm,” J. Artificial Intelligence Research, vol. 2, pp. 369-409, 1995.
[32] C. Blake and C. Merz, UCI Repository of Machine Learning Databases, 1998.
[33] M. Nunez, “The Use of Background Knowledge in Decision Tree Induction,” Machine Learning, vol. 6, pp. 231-250, 1991.
[34] M. Tan, “, Cost-Sensitive Learning of Classification Knowledge and Applications in Robotics,” Machine Learning, vol. 13, 1993.
[35] P. Hoel, “Likelihood Ratio Tests,” Introduction to Mathematical Statistics, third ed. New York: Wiley, 1962.
[36] C. Shannon and W. Warren, The Mathematical Theory of Communication. Univ. of Illinois Press, 1971.
[37] A. Freitas, “Understanding the Crucial Role of Attribute Interaction in Data Mining,” Artificial Intelligence Rev., vol. 16, no. 3, pp. 177-199, 2001.
[38] A.K. Jain and R.C. Dubes, Algorithms for Clustering Data. Prentice Hall, 1998.

Index Terms:
Index Terms- Data mining, intelligent data preparation, data acquisition, cost-sensitive, machine learning, instance ranking.
Xingquan Zhu, Xindong Wu, "Cost-Constrained Data Acquisition for Intelligent Data Preparation," IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 11, pp. 1542-1556, Nov. 2005, doi:10.1109/TKDE.2005.176
Usage of this product signifies your acceptance of the Terms of Use.