The Community for Technology Leaders
RSS Icon
Issue No.01 - January (2012 vol.24)
pp: 59-71
Dengyao Mo , University of Cincinnati, Cincinnati
Samuel H. Huang , University of Cincinnati, Cincinnati
Dimensionality reduction is an important step in knowledge discovery in databases. Intrinsic dimension indicates the number of variables necessary to describe a data set. Two methods, box-counting dimension and correlation dimension, are commonly used for intrinsic dimension estimation. However, the robustness of these two methods has not been rigorously studied. This paper demonstrates that correlation dimension is more robust with respect to data sample size. In addition, instead of using a user selected distance d, we propose a new approach to capture all log-log pairs of a data set to more precisely estimate the correlation dimension. Systematic experiments are conducted to study factors that influence the computation of correlation dimension, including sample size, the number of redundant variables, and the portion of log-log plot used for calculation. Experiments on real-world data sets confirm the effectiveness of intrinsic dimension estimation with our improved method. Furthermore, a new supervised dimensionality reduction method based on intrinsic dimension estimation was introduced and validated.
Intrinsic dimension, fractal dimension, feature selection, knowledge discovery in databases.
Dengyao Mo, Samuel H. Huang, "Fractal-Based Intrinsic Dimension Estimation and Its Application in Dimensionality Reduction", IEEE Transactions on Knowledge & Data Engineering, vol.24, no. 1, pp. 59-71, January 2012, doi:10.1109/TKDE.2010.225
[1] A. Belussi and C. Faloutsos, "Estimating the Selectivity of Spatial Queries Using the `Correlation' Fractal Dimension," Proc. 21st Int'l Conf. Very Large Data Bases, pp. 299-310, 1995.
[2] S.D. Bhavani, T.S. Rani, and R.S. Bapi, "Feature Selection Using Correlation Fractal Dimension: Issues and Applications in Binary Classification Problems," Applied Soft Computing, vol. 8, pp. 555-563, 2008.
[3] F. Camastra, "Data Dimensionality Estimation Methods: A Survey" Pattern Recognition, vol. 36, pp. 2945-2954, 2003.
[4] F. Camastra and A. Vinciarelli, "Estimating the Intrinsic Dimension of Data with a Fractal-Based Method," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 10, pp. 1404-1407, Oct. 2002.
[5] J. Costa and A.O. Hero, "Geodesic Entropic Graphs for Dimension and Entropy Estimation in Manifold Learning," IEEE Trans. Signal Processing, vol. 52, no. 8, pp. 2210-2221, Aug. 2004.
[6] T. Dasu and T. Johnson, Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003.
[7] T. Dunning, "Accurate Methods for the Statistics of Surprise and Coincidence," Computational Linguistics, vol. 19, no. 1, pp. 61-74, 1993.
[8] J.P. Eckmann and D. Ruelle, "Fundamental Limitations for Estimating Dimensions and Lyapunov Exponents in Dynamical Systems," Physica D: Nonlinear Phenomena, vol. 56, pp. 185-201, 1992.
[9] K. Fukunaga, "Intrinsic Dimensionality Extraction," Handbook of Statistics: Classification, Pattern Recognition and Reduction of Dimensionality, P.R. Krishnaiah and L.N. Kanal, eds., vol. 2, pp. 347-360, North-Holland, 1982.
[10] K. Fukunaga and D.R. Olsen, "An Algorithm for Finding Intrinsic Dimensionality of Data," IEEE Trans. Computers, vol. C-20, no. 2, pp. 165-171, Feb. 1976.
[11] P. Grassberger and I. Procaccia, "Measuring the Strangeness of Strange Attractors," Physica D: Nonlinear Phenomena, vol. 9, pp. 189-208, 1983.
[12] P.E. Greenwood and M.S. Nikulin, A Guide to Chi-Squared Testing. Wiley, 1996.
[13] I. Guyon and A. Elisseeff, "An Introduction to Variable and Feature Selection," J. Machine Learning Research, vol. 3, pp. 1157-1182, 2003.
[14] S.H. Huang, L.R. Wulsin1, and D. Mo, "A Systematic Study of Intrinsic Data Dimension and Fractal Dimension," Proc. NSF Eng. Research and Innovation Conf., 2009.
[15] F.J. Iannarilli and P.A. Rubin, "Feature Selection for Multiclass Discrimination via Mixed-Integer Linear Programming," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 6, pp. 779-783, June 2003.
[16] J. Jaeger, R. Sengupta, and W.L. Ruzzo, "Improved Gene Selection for Classification of Microarrays," Proc. Pacific Symp. Bio-Computing, pp. 53-64, 2003.
[17] I.T. Jolliffe, Principal Component Analysis, second ed., p. 28. Springer, 2002.
[18] B. Kegl, "Intrinsic Dimension Estimation Using Packing Numbers," Proc. Advances in Neural Information and Processing System (NIPS), 2002.
[19] M. Kirby, Geometric Data Analysis: An Empirical Approach to Dimensionality Reduction and the Study of Patterns. John Wiley and Sons, 2001.
[20] B. Klinkenberg, "A Review of Methods Used to Determine the Fractal Dimension of Linear Features," Math. Geology, vol. 26, no. 1, pp. 23-46 1994.
[21] R. Kohavi and G. John, "Wrapper for Feature Subset Selection," Artificial Intelligence, vol. 97, nos. 1/2, pp. 273-324, 1997.
[22] I. Kononenko, "On Biases in Estimating Multi-Valued Attributes," Proc. Int'l Joint Conf. Artificial Intelligence (IJCAI '95), pp. 1034-1040, 1995.
[23] M.A. Kramer, "Nonlinear Principal Component Analysis Using Autoassociative Neural Networks," AIChE J., vol. 37, no. 2, pp. 233-243, 1991.
[24] N. Kwak and C.H. Choi, "Input Feature Selection by Mutual Information Based on Parzen Window," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 12, pp. 1667-1671, Dec. 2002.
[25] P. Langley, "Selection of Relevant Features in Machine Learning," Proc. AAAI Fall Symp. Relevance, 1994.
[26] E. Levina and P.J. Bickel, "Maximum Likelihood Estimation of Intrinsic Dimension," Proc. Advances in Neural Information Processing Systems, 2005.
[27] W. Li and Y. Yang, "How Many Genes Are Needed for a Discriminant Microarray Data Analysis?," Proc. Critical Assessment of Techniques for Microarray Data Mining Workshop, pp. 137-150, Dec. 2000.
[28] H. Liu and H. Motoda, Computational Methods of Feature Selection. Chapman and Hall, 2008.
[29] R.L. Mantaras, "ID3 Revisited: A Distance Based Criterion for Attribute Selection," Proc. Int'l Symp. Methodologies for Intelligent Systems, 1989.
[30] T. Martinetz and K. Schulten, "Topology Representing Networks," Neural Networks, vol. 3, pp. 507-522, 1994.
[31] R. Modarres and J.L. Gastwirth, "A Cautionary Note on Estimating the Standard Error of the Gini Index of Inequality," Oxford Bull. of Economics and Statistics, vol. 68, pp. 385-390, 2006.
[32] H.C. Peng, F. Long, and C. Ding, "Feature Selection Based on Mutual Information Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 8, pp. 1226-1238, Aug. 2005.
[33] A.P. Pentland, "Fractal Based Description of Natural Scenes," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. PAMI-6, no. 6, pp. 661-674 Nov. 1984.
[34] V. Pestov, "An Axiomatic Approach to Intrinsic Dimension of a Data Set," Neural Networks, vol. 21, pp. 204-213, 2008.
[35] M. Robnik-Šikonja and I. Kononenko, "Theoretical and Empirical Analysis of ReliefF and RReliefF," Machine Learning J., vol. 53, pp. 23-69, 2003.
[36] B. Schölkopf, A. Smola, and K.R. Müller, "Nonlinear Component Analysis as a Kernel Eigenvalue Problem," Neural Computation, vol. 10, no. 5, pp. 1299-1399, 1998.
[37] C.E. Shannon, "Communication in the Presence of Noise," Proc. IEEE, vol. 86, no. 2, pp. 442-446, Feb. 1998.
[38] L.A. Smith, "Intrinsic Limits on Dimension Calculations," Physics Letters A, vol. 133, no. 6, pp. 283-288, 1988.
[39] P. Smyth and R.M. Goodman, "Rule Induction Using Information Theory," Knowledge Discovery in Databases, G. Piatetsky-Shapiro and W.J. Frawley, eds., AAAI Press, 1990.
[40] E.P.M.d. Sousa, C. TrainaJR, A.J.M. Traina, L. Wu, and C. Faloutsos, "A Fast and Effective Method to Find Correlations among Attributes in Databases," Data Mining and Knowledge Discovery, vol. 14, pp. 367-407, 2007.
[41] J. Theiler, "Estimating Fractal Dimension," J. Optical Soc. of Am., vol. 7, no. 6, pp. 1055-1073 1996.
[42] C. Traina, A. Traina, L. Wu, and C. Faloutsos, "Fast Feature Selection Using Fractal Dimension," Proc. Anais do XV Simpósio Brasileiro de Banco de Dados (SBBD '00), pp 158-171, 2000.
[43] G.V. Trunk, "Statistical Estimation of the Intrinsic Dimensionality of a Noisy Signal Collection," IEEE Trans. Computers, vol. C-25, no. 2, pp. 165-171, Feb. 1976.
[44] I.H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, second ed., pp. 99-102. Morgan Kaufman, 2005.
[45] E.P. Xing, M.I. Jordan, and R.M. Karp, "Feature Selection for High-Dimensional Genomic Microarray Data," Proc. 18th Int'l Conf. Machine Learning, 2001.
[46] M. Xiong, Z. Fang, and J. Zhao, "Biomarker Identification by Feature Wrappers," Genome Research, vol. 11, pp. 1878-1887, 2001.
[47] I. Yeh, "Modeling of Strength of High Performance Concrete Using Artificial Neural Networks," Cement and Concrete Research, vol. 28, no. 12, pp. 1797-1808, 1998.
22 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool