This Article 
 Bibliographic References 
 Add to: 
An Integrated Data Preparation Scheme for Neural Network Data Analysis
February 2006 (vol. 18 no. 2)
pp. 217-230
Data preparation is an important and critical step in neural network modeling for complex data analysis and it has a huge impact on the success of a wide variety of complex data analysis tasks, such as data mining and knowledge discovery. Although data preparation in neural network data analysis is important, some existing literature about the neural network data preparation are scattered, and there is no systematic study about data preparation for neural network data analysis. In this study, we first propose an integrated data preparation scheme as a systematic study for neural network data analysis. In the integrated scheme, a survey of data preparation, focusing on problems with the data and corresponding processing techniques, is then provided. Meantime, some intelligent data preparation solution to some important issues and dilemmas with the integrated scheme are discussed in detail. Subsequently, a cost-benefit analysis framework for this integrated scheme is presented to analyze the effect of data preparation on complex data analysis. Finally, a typical example of complex data analysis from the financial domain is provided in order to show the application of data preparation techniques and to demonstrate the impact of data preparation on complex data analysis.

[1] X. Hu, “DB-H Reduction: A Data Preprocessing Algorithm for Data Mining Applications,” Applied Math. Letters, vol. 16, pp. 889-895, 2003.
[2] K.U. Sattler and E. Schallehn, “A Data Preparation Framework Based on a Multidatabase Language,” Proc. Int'l Symp. Database Eng. & Applications, pp. 219-228, 2001.
[3] M. Lou, “Preprocessing Data for Neural Networks,” Technical Analysis of Stocks & Commodities Magazine, Oct. 1993.
[4] D. Pyle, Data Preparation for Data Mining. Morgan Kaufmann, 1999.
[5] M.W. Gardner and S.R. Dorling, “Artificial Neural Networks (the Multilayer Perceptron)— A Review of Applications in the Atmospheric Sciences,” Atmospheric Environment, vol. 32, pp. 2627-2636, 1998.
[6] M.Y. Rafiq, G. Bugmann, and D.J. Easterbrook, “Neural Network Design for Engineering Applications,” Computers & Structures, vol. 79, pp. 1541-1552, 2001.
[7] K.A. Krycha and U. Wagner, “Applications of Artificial Neural Networks in Management Science: A Survey,” J. Retailing and Consumer Services, vol. 6, pp. 185-203, 1999.
[8] K.J. Hunt, D. Sbarbaro, R. Bikowski, and P.J. Gawthrop, “Neural Networks for Control Systems— A Survey,” Automatica, vol. 28, pp. 1083-1112, 1992.
[9] D.E. Rumelhart, “The Basic Ideas in Neural Networks,” Comm. ACM, vol. 37, pp. 87-92, 1994.
[10] K.S. Narendra and K. Parthasarathy, “Identification and Control of Dynamic Systems Using Neural Networks,” IEEE Trans. Neural Networks, vol. 1, pp. 4-27, 1990.
[11] M.R. Azimi-Sadjadi and S.A. Stricker, “Detection and Classification of Buried Dielectric Anomalies Using Neural Networks— Further Results,” IEEE Trans. Instrumentations and Measurement, vol. 43, pp. 34-39, 1994.
[12] A. Beltratli, S. Margarita, and P. Terna, Neural Networks for Fconomic and Financial Modeling. London: Int'l Thomson Publishing Inc., 1996.
[13] Y. Senol and M.P. Gouch, “The Application of Transputers to a Sounding Rocket Instrumentation: On-Board Autocorrelators with Neural Network Data Analysis,” Parallel Computing and Transputer Applications, pp. 798-806, 1992.
[14] E.J. Gately, Neural Networks for Financial Forecasting. New York: John Wiley & Sons, Inc., 1996.
[15] A.N. Refenes, Y. Abu-Mostafa, J. Moody, and A. Weigend, Neural Networks in Financial Engineering. World Scientific Publishing Company, 1996.
[16] K.A. Smith and J.N.D. Gupta, Neural Networks in Business: Techniques and Applications. Hershey, Pa.: Idea Group Publishing, 2002.
[17] G.P. Zhang, Neural Networks in Business Forecasting. IRM Press, 2004.
[18] B.D. Klein and D.F. Rossin, “Data Quality in Neural Network Models: Effect of Error Rate and Magnitude of Error on Predictive Accuracy,” OMEGA, The Int'l J. Management Science, vol. 27, pp. 569-582, 1999.
[19] T.C. Redman, Data Quality: Management and Technology. New York: Bantam Books, 1992.
[20] T.C. Redman, Data Quality for the Information Age. Norwood, Mass.: Artech House, Inc., 1996.
[21] S. Zhang, C. Zhang, and Q. Yang, “Data Preparation for Data Mining,” Applied Artificial Intelligence, vol. 17, pp. 375-381, 2003.
[22] A. Famili, W. Shen, R. Weber, and E. Simoudis, “Data Preprocessing and Intelligent Data Analysis,” Intelligent Data Analysis, vol. 1, pp. 3-23, 1997.
[23] R. Stein, “Selecting Data for Neural Networks,” AI Expert, vol. 8, no. 2, pp. 42-47, 1993.
[24] R. Stein, “Preprocessing Data for Neural Networks,” AI Expert, vol. 8, no. 3, pp. 32-37, 1993.
[25] A.D. McAulay and J. Li, “Wavelet Data Compression for Neural Network Preprocessing,” Signal Processing, Sensor Fusion, and Target Recognition, vol. 1699, pp. 356-365, SPIE, 1992.
[26] V. Nedeljkovic and M. Milosavljevic, “On the Influence of the Training Set Data Preprocessing on Neural Networks Training,” Proc. 11th IAPR Int'l Conf. Pattern Recognition, pp. 1041-1045, 1992.
[27] J. Sjoberg, “Regularization as a Substitute for Preprocessing of Data in Neural Network Training,” Artificial Intelligence in Real-Time Control, pp. 31-35, 1992.
[28] O.E. De Noord, “The Influence of Data Preprocessing on the Robustness and Parsimony of Multivariate Calibration Models,” Chemometrics and Intelligent Laboratory Systems, vol. 23, pp. 65-70, 1994.
[29] J. DeWitt, “Adaptive Filtering Network for Associative Memory Data Preprocessing,” Proc. World Congress Neural Networks, vol. IV, pp. 34-38, 1994.
[30] D. Joo, D. Choi, and H. Park, “The Effects of Data Preprocessing in the Determination of Coagulant Dosing Rate,” Water Research, vol. 34, pp. 3295-3302, 2000.
[31] H.H. Nguyen and C.W. Chan, “A Comparison of Data Preprocessing Strategies for Neural Network Modeling of Oil Production Prediction,” Proc. Third IEEE Int'l Conf. Cognitive Informatics, 2004.
[32] J. Pickett, The American Heritage Dictionary, fourth ed. Boston: Houghton Mifflin, 2000.
[33] P. Ingwersen, Information Retrieval Interaction. London: Taylor Graham, 1992.
[34] U.Y. Nahm, “Text Mining with Information Extraction: Mining Prediction Rules from Unstructured Text,” PhD thesis, 2001.
[35] F. Lemke and J.A. Muller, “Self-Organizing Data Mining,” Systems Analysis Modelling Simulation, vol. 43, pp. 231-240, 2003.
[36] E. Tuv and G. Runger, “Preprocessing of High-Dimensional Categorical Predictors in Classification Setting,” Applied Artificial Intelligence, vol. 17, pp. 419-429, 2003.
[37] C.W.J Granger, “Investigating Causal Relations by Econometric Models and Cross-Spectral Methods,” Econometrica, vol. 37, pp. 424-438, 1969.
[38] K.I. Diamantaras and S.Y. Kung, Principal Component Neural Networks: Theory and Applications. John Wiley and Sons, Inc., 1996.
[39] D.W. Ashley and A. Allegrucci, “A Spreadsheet Method for Interactive Stepwise Multiple Regression,” Proceedings, pp. 594-596, Western Decision Sciences Inst., 1999.
[40] X. Yan, C. Zhang, and S. Zhang, “Toward Databases Mining: Preprocessing Collected Data,” Applied Artificial Intelligence, vol. 17, pp. 545-561, 2003.
[41] S. Chaudhuri and U. Dayal, “A Overview of Data Warehousing and OLAP Technology,” SIGMOD Record, vol. 26, pp. 65-74, 1997.
[42] S. Abiteboul, S. Cluet, T. Milo, P. Mogilevsky, J. Simeon, and S. Zohar, “Tools for Translation and Integration,” IEEE Data Eng. Bull., vol. 22, pp. 3-8, 1999.
[43] A. Baumgarten, “Probabilistic Solution to the Selection and Fusion Problem in Distributed Information Retrieval,” Proc. SIGIR'99, pp. 246-253, 1999.
[44] Y. Li, C. Zhang, and S. Zhang, “Cooperative Strategy for Web Data Mining and Cleaning,” Applied Artificial Intelligence, vol. 17, pp. 443-460, 2003.
[45] J.H. Holland, “Genetic Algorithms,” Scientific Am., vol. 267, pp. 66-72, 1992.
[46] D.E. Goldberg, Genetic Algorithm in Search, Optimization, and Machine Learning. Reading, Mass.: Addison-Wesley, 1989.
[47] A.M. Kupinski and M.L. Giger, “Feature Selection with Limited Datasets,” Medical Physics, vol. 26, pp. 2176-2182, 1999.
[48] Mani Bloedorn and E. Bloedorn, “Multidocument Summarization by Graph Search and Matching,” Proc. 15th Nat'l Conf. Artificial Intelligence, pp. 622-628, 1997.
[49] M. Saravanan, P.C. Reghu Raj, and S. Raman, “Summarization and Categorization of Text Data in High-Level Data Cleaning for Information Retrieval,” Applied Artificial Intelligence, vol. 17, pp. 461-474, 2003.
[50] W.A. Shewhart, Economic Control of Quality of Manufactured Product. New York: D. Van Nostrand, 1931.
[51] D.A. Dickey and W.A. Fuller, “Distribution of the Estimators for Autoregressive Time Series with a Unit Root,” J. Am. Statistical Assoc., vol. 74, pp. 427-431, 1979.
[52] J. Wang, C. Zhang, X. Wu, H. Qi, and J. Wang, “SVM-OD: A New SVM Algorithm for Outlier Detection,” Proc. ICDM'03 Workshop Foundations and New Directions of Data Mining, pp. 203-209, 2003.
[53] J. Han and Y. Fu, “Dynamic Generation and Refinement of Concept Hierarchies for Knowledge Discovery in Database,” Proc. AAAI '94 Workshop Knowledge Discovery in Database, pp. 157-168, 1994.
[54] U. Fayyad and K. Irani, “Multiinterval Discretization of Continuous-Valued Attributes for Classification Learning,” Proc. 13th Int'l Joint Conf. Artificial Intelligence, pp. 1022-1027, 1993.
[55] A. Srinivasan, S. Muggleton, and M. Bain, “Distinguishing Exceptions from Noise in Nonmonotonic Learning,” Proc. Second Int'l Workshop Inductive Logic Programming, 1992.
[56] G.H. John, “Robust Decision Trees: Removing Outliers from Data,” Proc. First Int'l Conf. Knowledge Discovery and Data Mining, pp. 174-179, 1995.
[57] D. Gamberger, N. Lavrac, and S. Dzeroski, “Noise Detection and Elimination in Data Preprocessing: Experiments in Medical Domains,” Applied Artificial Intelligence, vol. 14, pp. 205-223, 2000.
[58] G.E. Batista and M.C. Monard, “Experimental Comparison of K-Nearest Neighbor and Mean or Mode Imputation Methods with the Internal Strategies Used by C4.5 and CN2 to Treat Missing Data,” Technical Report 186, ICMC USP, 2003.
[59] G.E. Batista and M.C. Monard, “An Analysis of Four Missing Data Treatment Methods for Supervised Learning,” Applied Artificial Intelligence, vol. 17, pp. 519-533, 2003.
[60] R.J. Little and P.M. Murphy, Statistical Analysis with Missing Data. New York: John Wiley and Sons, 1987.
[61] A. Ragel and B. Cremilleux, “Treatment of Missing Values for Association Rules,” Proc. Second Pacific-Asia Conf. Knowledge Discovery and Data Mining, pp. 258-270, 1998.
[62] R.C.T. Lee, J.R. Slagle, and C.T. Mong, “Application of Clustering to Estimate Missing Data and Improve Data Integrity,” Proc. Int'l Conf. Software Eng., pp. 539-544, 1976.
[63] S.M. Tseng, K.H. Wang, and C.I. Lee, “A Preprocessing Method to Deal with Missing Values by Integrating Clustering and Regression Techniques,” Applied Artificial Intelligence, vol. 17, pp. 535-544, 2003.
[64] A.S. Weigend and N.A. Gershenfeld, Time Series Prediction: Forecasting the Future and Understanding the Past. Addison-Wesley, 1994.
[65] F.M. Tseng, H.C. Yu, and G.H. Tzeng, “Combining Neural Network Model with Seasonal Time Series ARIMA Model,” Technological, Forecasting, and Social Change, vol. 69, pp. 71-87, 2002.
[66] J. Moody, “Economic Forecasting: Challenges and Neural Network Solution,” Proc. Int'l Symp. Artificial Neural Networks, 1995.
[67] J.T. Yao and C.L. Tan, “A Case Study on Using Neural Networks to Perform Technical Forecasting of Forex,” Neurocomputing, vol. 34, pp. 79-98, 2000.
[68] K. Hornik, M. Stinchcombe, and H. White, “Multilayer Feedforward Networks Are Universal Approximators,” Neural Networks, vol. 2, no. 5, pp. 359-366, 1989.
[69] A. Esposito, M. Marinaro, D. Oricchio, and S. Scarpetta, “Approximation of Continuous and Discontinuous Mappings by a Growing Neural RBF Based Algorithm,” Neural Networks, vol. 13, pp. 651-665, 2000.
[70] H. Martens and T. Naes, Multivariate Calibration. New York: John Wiley & Sons Inc., 1989.
[71] R. Rojas, Neural Networks: A Systematic Introduction. Berlin: Springer-Verlag, 1996.
[72] S. Geman, E. Bienenstock, and R. Doursat, “Neural Networks and the Bias/Variance Dilemma,” Neural Computation, vol. 4, pp. 1-58, 1992.
[73] U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Advances in Knowledge Discovery and Data Mining. Menlo Park, Calif.: AAAI Press, 1996.

Index Terms:
Index Terms- Data preparation, neural networks, complex data analysis, cost-benefit analysis.
Lean Yu, Shouyang Wang, K.K. Lai, "An Integrated Data Preparation Scheme for Neural Network Data Analysis," IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 2, pp. 217-230, Feb. 2006, doi:10.1109/TKDE.2006.22
Usage of this product signifies your acceptance of the Terms of Use.