This Article 
 Bibliographic References 
 Add to: 
Feature Extraction Using Information-Theoretic Learning
September 2006 (vol. 28 no. 9)
pp. 1385-1392
A classification system typically consists of both a feature extractor (preprocessor) and a classifier. These two components can be trained either independently or simultaneously. The former option has an implementation advantage since the extractor need only be trained once for use with any classifier, whereas the latter has an advantage since it can be used to minimize classification error directly. Certain criteria, such as Minimum Classification Error, are better suited for simultaneous training, whereas other criteria, such as Mutual Information, are amenable for training the feature extractor either independently or simultaneously. Herein, an information-theoretic criterion is introduced and is evaluated for training the extractor independently of the classifier. The proposed method uses nonparametric estimation of Renyi's entropy to train the extractor by maximizing an approximation of the mutual information between the class labels and the output of the feature extractor. The evaluations show that the proposed method, even though it uses independent training, performs at least as well as three feature extraction methods that train the extractor and classifier simultaneously.

[1] B.D. Ripley, Pattern Recognition and Neural Networks. Cambridge Univ. Press, 1995.
[2] T.M. Cover and J.A. Thomas, Elements of Information Theory. John Wiley & Sons, 1991.
[3] J.C. Principe, D. Xu, Q. Zhao, and J.W. FisherIII, “Learning from Examples with Information Theoretic Criteria,” J. VLSI Signal Proc. Systems, vol. 26, nos. 1/2, pp. 61-77, Aug. 2000.
[4] D. Erdogmus and J.C. Principe, “Lower and Upper Bounds for Misclassification Probability Based on Renyi's Information,” J. VLSI Signal Processing, vol. 37, nos. 2-3, pp. 305-317, June 2004.
[5] M.E. Hellman and J. Raviv, “Probability of Error, Equivocation, and the Chernoff Bound,” IEEE Trans. Information Theory, vol. 16, no. 4, pp. 368-372, July 1970.
[6] R. Battiti, “Using Mutual Information for Selecting Features in Supervised Neural Net Learning,” IEEE Trans. Neural Networks, vol. 5, no. 4, pp. 537-550, July 1994.
[7] H.H. Yang and J. Moody, “Feature Selection Based on Joint Mutual Information,” Proc. Conf. Advances in Intelligent Data Analysis, Computational Intelligence Methods, and Applications, June 1999.
[8] K.D. Bollacker and J. Ghosh, “Mutual Information Feature Extractors for Neural Classifiers,” Proc. Int'l Conf. Neural Networks (ICNN '96), pp. 1528-1533, June 1996.
[9] N. Kwak and C.-H. Choi, “Improved Mutual Information Feature Selector for Neural Networks in Supervised Learning,” Proc. Int'l Joint Conf. Neural Networks, vol. 2, pp. 1313-1318, July 1999.
[10] R. Rajagopal, K.A. Kumar, and P.R. Rao, “An Integrated Approach to Passive Target Classification,” Proc. Int'l Conf. Acoustics, Speech, and Signal Processing, vol. 2, pp. 313-316, Apr. 1994.
[11] K.E. HildII, D. Erdogmus, and J.C. Principe, “An Analysis of Entropy Estimators for Blind Source Separation,” Signal Processing, vol. 86, no. 1, pp. 182-194, Jan. 2006.
[12] A. Renyi, Probability Theory. Amsterdam: North-Holland Publishing Company, 1970.
[13] K.E. HildII, D. Erdogmus, and J.C. Principe, “On-Line Minimum Mutual Information Method for Time-Varying Blind Source Separation,” Proc. Int'l Workshop Independent Component Analysis and Signal Separation, pp. 126-131, Dec. 2001.
[14] D. Erdogmus, K.E. HildII, and J.C. Principe, “On-Line Entropy Manipulation: Stochastic Information Gradient,” IEEE Signal Processing Letters, vol. 10, no. 8, pp. 242-245, Aug. 2003.
[15] J. Beirlant, E.J. Dudewica, L. Gyofi, and E. van der Meulen, “Nonparametric Entropy Estimation: An Overview,” Int'l J. Math. Statistics Sciences, vol. 6, no. 1, pp. 17-39, 1997.
[16] E. Parzen, “On Estimation of a Probability Density Function and Mode,” Annals of Math. Statistics, vol. 33, no. 3, pp. 1065-1076, Sept. 1962.
[17] G.H. Golub and C.F. Van Loan, Matrix Computations, third ed. Baltimore: John Hopkins Univ. Press, 1996.
[18] S. Theodoridis and K. Koutroumbas, Pattern Recognition. San Diego, Calif.: Academic Press, 1999.
[19] K.E. HildII, D. Erdogmus, and J.C. Principe, “Blind Source Separation Using Renyi's Mutual Information,” IEEE Signal Processing Letters, vol. 8, no. 6, pp. 174-176, June 2001.
[20] R.A. Morejon, “An Information-Theoretic Approach to Sonar Automatic Target Recognition,” PhD dissertation, Univ. of Florida, 2003.
[21] C. Bishop, Neural Networks for Pattern Recognition. Oxford, U.K.: Oxford Univ. Press, 1995.
[22] S.C. Fralick and R.W. Scott, “Nonparametric Bayes-Risk Estimation,” IEEE Trans. Information Theory, vol. 17, no. 4 pp. 440-444, July 1971.
[23] K. Torkkola, “On Feature Extraction by Mutual Information Maximization,” Proc. Int'l Conf. Acoustics, Speech, and Signal Processing, pp. 821-825, May 2002.
[24] K. Torkkola, “Learning Discriminative Feature Transforms to Low Dimensions in Low Dimensions,” Proc. Conf. Advances in Neural Information Processing Systems, Dec. 2001.
[25] K. Torkkola and W.M. Campbell, “Mutual Information in Learning Feature Transformations,” Proc. Int'l Conf. Machine Learning, pp. 1015-1022, June 2000.
[26] K. Torkkola, “Visualizing Class Structure in Data Using Mutual Information,” Proc. Conf. Neural Networks for Signal Proc. (NNSP '00), pp. 376-385, Dec. 2000.
[27] D. Xu and J.C. Principe, “Feature Evaluation Using Quadratic Mutual Information,” Proc. Int'l Joint Conf. Neural Networks, vol. 1, pp. 459-463, July 2001.
[28] A. Biem, S. Katagiri, and B.-H. Juang, “Pattern Recognition Using Discriminative Feature Extraction,” IEEE Trans. Signal Processing, vol. 45, no. 2, pp. 500-504, Feb. 1997.
[29] H. Watanabe, T. Yamaguchi, and S. Katagiri, “Discriminative Metric Design for Robust Pattern Recognition,” IEEE Trans. Signal Processing, vol. 45, no. 11, pp. 2655-2662, Nov. 1997.
[30] S. Katagiri, B.-H. Juang, and C.-H. Lee, “Pattern Recognition Using a Family of Design Algorithms Based upon the Generalized Probabilistic Descent Method,” Proc. IEEE, vol. 86, no. 11, pp. 2345-2373, Nov. 1998.
[31] B.-H. Juang and S. Katagiri, “Discriminative Learning for Minimum Error Classification,” IEEE Trans. Signal Processing, vol. 40, no. 12, pp. 3043-3054, Dec. 1992.
[32] A. Biem, S. Katagiri, and B.-H. Juang, “Discriminative Feature Extraction for Speech Recognition,” Proc. Conf. Neural Networks for Signal Processing (NNSP '93), pp. 392-401, Sept. 1993.
[33] Q. Li and B.-H. Juang, “A New Algorithm for Fast Discriminative Training,” Proc. Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '02), vol. 1, pp. 97-100, May 2002.
[34] V. Nedeljkovic, “A Novel Multilayer Neural Networks Training Algorithm that Minimizes the Probability of Classification Error,” IEEE Trans. Neural Networks, vol. 4, no. 4, pp. 650-659, July 1993.
[35] K. Fukunaga, Introduction to Statistical Pattern Recognition, second ed. Boston: Academic Press, 1990.
[36] D. Erdogmus, K.E. HildII, and J.C. Principe, “Kernel Size Selection in Parzen Density Estimation,” J. VLSI Signal Processing Systems, submitted.
[37] D. Erdogmus and J.C. Principe, “Generalized Information Potential Criterion for Adaptive System Training,” IEEE Trans. Neural Networks, Sept. 2002.
[38] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-Based Learning Applied to Document Recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998.

Index Terms:
Feature extraction, information theory, classification, nonparametric statistics.
Kenneth E. Hild, Deniz Erdogmus, Kari Torkkola, Jose C. Principe, "Feature Extraction Using Information-Theoretic Learning," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 9, pp. 1385-1392, Sept. 2006, doi:10.1109/TPAMI.2006.186
Usage of this product signifies your acceptance of the Terms of Use.