This Article 
 Bibliographic References 
 Add to: 
On the Use of Conceptual Reconstruction for Mining Massively Incomplete Data Sets
November/December 2003 (vol. 15 no. 6)
pp. 1512-1521
Srinivasan Parthasarathy, IEEE Computer Society

Abstract—Incomplete data sets have become almost ubiquitous in a wide variety of application domains. Common examples can be found in climate and image data sets, sensor data sets, and medical data sets. The incompleteness in these data sets may arise from a number of factors: In some cases, it may simply be a reflection of certain measurements not being available at the time, in others, the information may be lost due to partial system failure, or it may simply be a result of users being unwilling to specify attributes due to privacy concerns. When a significant fraction of the entries are missing in all of the attributes, it becomes very difficult to perform any kind of reasonable extrapolation on the original data. For such cases, we introduce the novel idea of conceptual reconstruction in which we create effective conceptual representations on which the data mining algorithms can be directly applied. The attraction behind the idea of conceptual reconstruction is to use the correlation structure of the data in order to express it in terms of concepts rather than the original dimensions. As a result, the reconstruction procedure estimates only those conceptual aspects of the data which can be mined from the incomplete data set, rather than force errors created by extrapolation. We demonstrate the effectiveness of the approach on a variety of real data sets.

[1] C.C. Aggarwal, On the Effects of Dimensionality Reduction on High Dimensional Similarity Search Proc. ACM Symp. Principles of Database Systems Conf., 2001.
[2] C.C. Aggarwal and S. Parthasarathy, Mining Massively Incomplete Data Sets by Conceptual Reconstruction Proc. ACM Knowledge Discovery and Data Mining Conf., 2001.
[3] R. Agrawal and R. Srikant, Privacy Preserving Data Mining Proc. ACM SIGMOD, 2000.
[4] L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone, Classification and Regression Trees. New York: Chapman&Hall, 1984.
[5] A.P. Dempster, N.M. Laird, and D.B. Rubin, Maximum Likelihood from Incomplete Data via the EM Algorithm J. Royal Statistical Soc. Series, vol. 39, pp. 1-38, 1977.
[6] A.W. Drake, Fundamentals of Applied Probability Theory. McGraw-Hill, 1967.
[7] Z. Ghahramani and M.I. Jordan, Learning from Incomplete Data Dept. of Brain and Cognitive Sciences, Paper No. 108, Massachusetts Institute of Tech nology, 1994.
[8] I.T. Jolliffe, Principal Component Analysis. New York: Springer-Verlag, 1986.
[9] J. Kleinberg and A. Tomkins, Applications of Linear Algebra to Information Retrieval and Hypertext Analysis Proc. ACM Symp. Principles of Database Systems Conf., Tutorial Survey, 1999.
[10] R. Little and D. Rubin, Statistical Analysis With Missing Data. Wiley, 1987.
[11] R.J.A. Little and M.D. Schluchter, Maximum Likelihood Estimate for Mixed Continuous and Categorical Data with Missing Values Biometrika, vol. 72, pp. 497-512, 1985.
[12] G.J. McLachlan and T. Krishnan, The EM Algorithm and Extensions. John Wiley&Sons, 1997.
[13] C.H. Papadimitriou, P. Raghavan, H. Tamaki, and S. Vempala, Latent Semantic Indexing: A Probabilistic Analysis Proc. ACM Symp. Principles of Database Systems Conf.,, 1998.
[14] K.V.R. Kanth, D. Agrawal, and A. Singh, “Dimensionality Reduction for Similarity Searching in Dynamic Databases,” Proc. ACM SIGMOD Conf., 1998.
[15] S. Rowells, EM Algorithms for PCA and SPCA Advances in Neural Information Processing Systems, M.I. Jordan, M.J. Kearns, and S.A. Solla, eds., vol. 10, MIT Press, 1998.
[16] D.B. Rubin, Advances in Neural Information Processing Systems Multiple Imputation for Nonresponse in Surveys, vol. 10, pp. 626-631, Morgan Kaufmann, 1998. Also in Multiple Imputation for Nonresponse in Surveys, New York: Wiley, 1998.
[17] J. Schafer, Analysis of Incomplete Data Sets by Simulation. London: Chapman and Hall, 1994.
[18] J. Schafer, Analysis of Incomplete Multivariate Data. London: Chapman and Hall, 1997.
[19] J.R. Quinlan, C4.5: Programs for Machine Learning,San Mateo, Calif.: Morgan Kaufman, 1992.
[20] J.R. Quinlan, Unknown Attribute Values in Induction Proc. Sixth Int'l Conf. Machine Learning, 1989.

Index Terms:
Incomplete data, missing values, data mining.
Srinivasan Parthasarathy, Charu C. Aggarwal, "On the Use of Conceptual Reconstruction for Mining Massively Incomplete Data Sets," IEEE Transactions on Knowledge and Data Engineering, vol. 15, no. 6, pp. 1512-1521, Nov.-Dec. 2003, doi:10.1109/TKDE.2003.1245289
Usage of this product signifies your acceptance of the Terms of Use.