This Article 
 Bibliographic References 
 Add to: 
Information Loss of the Mahalanobis Distance in High Dimensions: Application to Feature Selection
December 2009 (vol. 31 no. 12)
pp. 2275-2281
Dimitrios Ververidis, Aristotle University of Thessaloniki, Thessaloniki
Constantine Kotropoulos, Aristotle University of Thessaloniki, Thessaloniki
When an infinite training set is used, the Mahalanobis distance between a pattern measurement vector of dimensionality D and the center of the class it belongs to is distributed as a \chi^2 with D degrees of freedom. However, the distribution of Mahalanobis distance becomes either Fisher or Beta depending on whether cross validation or resubstitution is used for parameter estimation in finite training sets. The total variation between \chi^2 and Fisher, as well as between \chi^2 and Beta, allows us to measure the information loss in high dimensions. The information loss is exploited then to set a lower limit for the correct classification rate achieved by the Bayes classifier that is used in subset feature selection.

[1] W. Highleyman, “The Design and Analysis of Pattern Recognition Experiments,” Bell System Technical J., vol. 41, pp. 723-744, 1962.
[2] R. Duda and P. Hart, Pattern Classification and Scene Analysis. Wiley, 1973.
[3] D. Foley, “Considerations of Sample and Feature Size,” IEEE Trans. Information Theory, vol. 18, no. 5, pp. 618-626, Sept. 1972.
[4] S. Raudys and V. Pikelis, “On Dimensionality, Sample Size, Classification Error and Complexity of Classification Algorithm in Pattern Recognition,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 2, no. 3, pp. 242-252, May 1980.
[5] K. Fukunaga and R. Hayes, “Effects of Sample Size in Classifier Design,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 11, no. 8, pp. 873-885, Aug. 1989.
[6] S. Raudys and A. Jain, “Small Sample Size Effects in Statistical Pattern Recognition: Recommendations for Practitioners,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 13, no. 3, pp. 252-264, Mar. 1991.
[7] S. Raudys, “On Dimensionality, Sample Size, and Classification Error of Nonparametric Linear Classification Algorithms,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 19, no. 6, pp. 667-671, June 1997.
[8] S. Raudys, “First-Order Tree-Type Dependence between Variables and Classification Performance,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, no. 2, pp. 233-239, Feb. 2001.
[9] P. Devijver and J. Kittler, Pattern Recognition: A Statistical Approach. Prentice Hall, 1982.
[10] V. Vapnik, Statistical Learning Theory. Wiley, 1998.
[11] F. van der Heijden, R. Duin, D. de Ridder, and D.M.J. Tax, Classification, Parameter Estimation and State Estimation—An Engineering Approach Using Matlab. Wiley, 2004.
[12] J. Hoffbeck and D. Landgrebe, “Covariance Matrix Estimation and Classification with Limited Training Data,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 18, no. 7, pp. 763-767, July 1996.
[13] F. Liese and I. Vajda, “On Divergences and Informations in Statistics and Information Theory,” IEEE Trans. Information Theory, vol. 52, no. 10, pp. 4394-4412, Oct. 2006.
[14] R. Corless, G. Gonnet, D. Hare, D. Jeffrey, and D. Knuth, “On the Lambert $W$ Function,” Advances in Computational Math., vol. 5, pp. 329-359, 1996.
[15] D. Ververidis and C. Kotropoulos, “Fast and Accurate Feature Subset Selection Applied to Speech Emotion Recognition,” Elsevier Signal Processing, vol. 88, no. 12, pp. 2956-2970, 2008.
[16] P. Pudil, J. Novovicova, and J. Kittler, “Floating Search Methods in Feature Selection,” Pattern Recognition Letters, vol. 15, pp. 1119-1125, 1994.
[17] I. Kononenko, E. Simec, and M. Sikonja, “Overcoming the Myopia of Inductive Learning Algorithms with RELIEFF,” Applied Intelligence, vol. 7, pp. 39-55, 1997.
[18] D. Ververidis and C. Kotropoulos, “Fast Sequential Floating Forward Selection Applied to Emotional Speech Features Estimated on DES and SUSAS Data Collections,” Proc. European Signal Processing Conf., 2006.
[19] B. Womack and J. Hansen, “N-Channel Hidden Markov Models for Combined Stressed Speech Classification and Recognition,” IEEE Trans. Speech and Audio Processing, vol. 7, no. 6, pp. 668-677, Nov. 1999.
[20] U. Alon, N. Barkai, D. Notterman, K. Gish, S. Ybarra, D. Mack, and A. Levine, “Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Array,” Proc. Nat'l Academy of Sciences USA, vol. 96, no. 12, pp. 6745-6750, 1999.
[21] R. Gorman and T. Sejnowski, “Analysis of Hidden Units in a Layered Network Trained to Classify Sonar Targets,” Neural Networks, vol. 1, pp. 75-89, 1988.
[22] C. Ding and H. Peng, “Minimum Redundancy Feature Selection from Microarray Gene Expression Data,” J. Bioinformatics and Computational Biology, vol. 3, no. 2, pp. 185-205, 2005.
[23] K. Pearson, “On the Criterion that a Given System of Deviations from the Probable in the Case of a Correlated System Variables Is Such That It Can Be Reasonably Supposed to Have Arisen from Random Sampling,” Philosophical Magazine, vol. 50, pp. 157-175, 1900.
[24] M. Abramowitz and I.A. Stegun, Handbook of Mathematical Functions. Dover, 1972.
[25] T. Anderson, An Introduction to Multivariate Statistics. Wiley, 1984.
[26] A. Papoulis and S.U. Pillai, Probability, Random Variables, and Stochastic Processes, fourth ed. McGraw-Hill, 2002.
[27] D. Ververidis and C. Kotropoulos, “Gaussian Mixture Modeling by Exploiting the Mahalanobis Distance,” IEEE Trans. Signal Processing, vol. 56, no. 7B, pp. 2797-2811, July 2008.

Index Terms:
Bayes classifier, Gaussian distribution, Mahalanobis distance, feature selection, cross validation.
Dimitrios Ververidis, Constantine Kotropoulos, "Information Loss of the Mahalanobis Distance in High Dimensions: Application to Feature Selection," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 12, pp. 2275-2281, Dec. 2009, doi:10.1109/TPAMI.2009.84
Usage of this product signifies your acceptance of the Terms of Use.