Subscribe

Issue No.01 - January (2011 vol.23)

pp: 110-121

Xiaofeng Zhu , University Technology Sydney, Sydney, Australia

Shichao Zhang , Zhejiang Normal University, Jinhua, China

Zhi Jin , Beijing University, Beijing, China

Zili Zhang , Southwest University Chongqing, China

Zhuoming Xu , Hohai University, Nanjing, China

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2010.99

ABSTRACT

Missing data imputation is a key issue in learning from incomplete data. Various techniques have been developed with great successes on dealing with missing values in data sets with homogeneous attributes (their independent attributes are all either continuous or discrete). This paper studies a new setting of missing data imputation, i.e., imputing missing data in data sets with heterogeneous attributes (their independent attributes are of different types), referred to as imputing mixed-attribute data sets. Although many real applications are in this setting, there is no estimator designed for imputing mixed-attribute data sets. This paper first proposes two consistent estimators for discrete and continuous missing target values, respectively. And then, a mixture-kernel-based iterative estimator is advocated to impute mixed-attribute data sets. The proposed method is evaluated with extensive experiments compared with some typical algorithms, and the result demonstrates that the proposed approach is better than these existing imputation methods in terms of classification accuracy and root mean square error (RMSE) at different missing ratios.

INDEX TERMS

Classification, data mining, methodologies, machine learning.

CITATION

Xiaofeng Zhu, Shichao Zhang, Zhi Jin, Zili Zhang, Zhuoming Xu, "Missing Value Estimation for Mixed-Attribute Data Sets",

*IEEE Transactions on Knowledge & Data Engineering*, vol.23, no. 1, pp. 110-121, January 2011, doi:10.1109/TKDE.2010.99REFERENCES

- [1] I.A. Ahamad and P.B. Cerrito, "Nonparametric Estimation of Joint Discrete-Continuous Probability Densities with Applications,"
J. Statistical Planning and Inference, vol. 41, pp. 349-364, 1994.- [2] P. Allison,
Missing Data. Sage Publication, Inc., 2001.- [3] J. Aitchison and C.G.G. Aitken, "Multivariate Binary Discrimination by the Kernel Method,"
Biometrika, vol. 63, pp. 413-420, 1976.- [4] J. Barnard and D. Rubin, "Small-Sample Degrees of Freedom with Multiple Imputation,"
Biometrika, vol. 86, pp. 948-955, 1999.- [5] G. Batista and M. Monard, "An Analysis of Four Missing Data Treatment Methods for Supervised Learning,"
Applied Artificial Intelligence, vol. 17, pp. 519-533, 2003.- [6] H. Bierens, "Uniform Consistency of Kernel Estimators of a Regression Function under Generalized Conditions,"
J. Am. Statistical Assoc., vol. 78, pp. 699-707, 1983.- [7] C. Blake and C. Merz UCI Repository of Machine Learning Database, http://www.ics.uci.edu/~mlearnMLResoesitory. html , 1998.
- [8] M.L. Brown,, "Data Mining and the Impact of Missing Data,"
Industrial Management and Data Systems, vol. 103, no. 8, pp. 611-621, 2003.- [9] R. Caruana, "A Non-Parametric EM-Style Algorithm for Imputing Missing Value,"
Artificial Intelligence and Statistics, Jan. 2001.- [10] K. Cios and L. Kurgan, "Knowledge Discovery in Advanced Information Systems,"
Trends in Data Mining and Knowledge Discovery, N. Pal, L. Jain, and N. Teoderesku, eds., Springer, 2002.- [11] M.A. Delgado and J. Mora, "Nonparametric and Semi-Parametric Estimation with Discrete Regressors,"
Econometrica, vol. 63, pp. 1477-1484, 1995.- [12] A. Dempster and D. Rubin,
Incomplete Data in Sample Surveys: Theory and Bibliography, W.G. Madow, I. Olkin, and D. Rubin, eds., vol. 2, pp. 3-10, Academic Press, 1983.- [13] A. Dempster, N.M. Laird, and D. Rubin, "Maximum Likelihood from Incomplete Data via the EM Algorithm,"
J. Royal Statistical Soc., vol. 39, pp. 1-38, 1977.- [14] U. Dick et al., "Learning from Incomplete Data with Infinite Imputation,"
Proc. Int'l Conf. Machine Learning (ICML '08), pp. 232-239, 2008.- [15] Z. Ghahramani and M. Jordan, "Mixture Models for Learning from Incomplete Data,"
Computational Learning Theory and Natural Learning Systems, R. Greiner, T. Petsche, and S.J. Hanson, eds., vol. IV: Making Learning Systems Practical, pp. 67-85, The MIT Press, 1997.- [16] J. Han and M. Kamber,
Data Mining Concepts and Techniques, second ed. Morgan Kaufmann Publishers, 2006.- [17] M. Huisman, "Missing Data in Social Network,"
Proc. Int'l Sunbelt Social Network Conf. (Sunbelt XXVII), 2007.- [18] G. John et al., "Ir-Relevant Features and the Subset Selection Problem,"
Proc. 11th Int'l Conf. Machine Learning, W. Cohen and H. Hirsch, eds., pp. 121-129, 1994.- [19] M.C. Jones, J.S. Marron, and S.J. Sheather, "A Brief Survey of Bandwidth Selection for Density Estimation,"
J. Am. Statistical Assoc., vol. 91, no. 433, pp. 401-407, 1996.- [20] E.M. Jordaan, "Development of Robust Inferential Sensors: Industrial Application of Support Vector Machines for Regression," PhD thesis, Technical University Eindhoven, 2002.
- [21] K. Lakshminarayan et al., "Imputation of Missing Data in Industrial Databases,"
Applied Intelligence, vol. 11, pp. 259-275, 1999.- [22] R. Little and D. Rubin,
Statistical Analysis with Missing Data, second ed. John Wiley and Sons, 2002.- [23] R. Marco, "Learning Bayesian Networks from Incomplete Databases," Technical Report kmi-97-6, Knowledge Media Inst., The Open Univ., 1997.
- [24] C. Peng and J. Zhu, "Comparison of Two Approaches for Handling Missing Covariates in Logistic Regression,"
Educational and Psychological Measurement, vol. 68, no. 1, pp. 58-77, 2008.- [25] Y.S. Qin et al., "Semi-Parametric Optimization for Missing Data Imputation,"
Applied Intelligence, vol. 21, no. 1, pp. 79-88, 2007.- [26] Y.S. Qin et al., "POP Algorithm: Kernel-Based Imputation to Treat Missing Values in Knowledge Discovery from Databases,"
Expert Systems with Applications, vol. 36, pp. 2794-2804, 2009.- [27] J.R. Quinlan, "Unknown Attribute values in Induction,"
Proc. Sixth Int'l Workshop Machine Learning, pp. 164-168, 1989.- [28] J.R. Quinlan,
C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.- [29] J. Racine and Q. Li, "Nonparametric Estimation of Regression Functions with Both Categorical and Continuous Data,"
J. Econometrics, vol. 119, no. 1, pp. 99-130, 2004.- [30] V.C. Raykar and R. DuraiswamiFast, "Fast Optimal Bandwidth Selection for Kernel Density Estimation,"
Proc. SIAM Int'l Conf. Data Mining (SDM '06), pp. 524-528, 2006.- [31] D. Rubin,
Multiple Imputation for Nonresponse in Surveys. Wiley, 1987.- [32] J.L. Schafer,
Analysis of Incomplete Multivariate Data. Chapman and Hall/CRC, 1997.- [33] B. Silverman,
Density Estimation for Statistics and Data Analysis. Chapman and Hall, 1986.- [34] G.F. Smits and E.M. Jordaan, "Improved SVM Regression Using Mixtures of Kernels,"
Proc. 2002 Int'l Joint Conf. Neural Networks, pp. 2785-2790, 2002.- [35] Q.H. Wang and R. Rao, "Empirical Likelihood-Based Inference under Imputation for Missing Response Data,"
Annals of Statistics, vol. 30, pp. 896-924, 2002.- [36] S.C. Zhang et al., "Missing Is Useful: Missing Values in Cost-Sensitive Decision Trees,"
IEEE Trans. Knowledge and Data Eng., vol. 17, no. 12, pp. 1689-1693, Dec. 2005.- [37] S.C. Zhang et al., "Information Enhancement for Data Mining,"
IEEE Intelligent Systems, vol. 19, no. 2, pp. 12-13, Mar./Apr. 2004.- [38] S.C. Zhang, "Parimputation: From Imputation and Null-Imputation to Partially Imputation,"
IEEE Intelligent Informatics Bull., vol. 9, no. 1, pp. 32-38, Nov. 2008.- [39] S. Zhang, "Shell-Neighbor Method and Its Application in Missing Data Imputation,"
Applied Intelligence, doi: 10.1007/s10489-009-0207-6. - [40] W. Zhang, "Association Based Multiple Imputation in Multivariate Data Sets: A Summary,"
Proc. Int'l Conf. Data Eng. (ICDE), p. 310, 2000.- [41] S. Zheng, J. Liu, and J. Tian, "An Efficient Star Acquisition Method Based on SVM with Mixtures of Kernels,"
Pattern Recognition Letters, vol. 26, pp. 147-165, 2005.- [42] C. Zhang, X. Zhu, J. Zhang, Y. Qin, and S. Zhang, "GBKII: An Imputation Method for Missing Values,"
Proc. 11th Pacific-Asia Knowledge Discovery and Data Mining Conf. (PAKDD '07), pp. 1080-1087, 2007.- [43] J. Hausman, B.H. Hall, and Z. Griliches, "Econometric Models for Count Data with an Application to the Patents-R&D Relationship,"
Econometrica, vol. 52, no. 4, pp. 909-938, 1984. |